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ABSTRACT 


This  report  gives  a  detailed  treatment  of  the  use  of  linear 
prediction  in  speech  analysis.  New  concepts  are  developed  and 
more  familiar  concepts  ars  seen  in  a  new  way.  The  Covariance 
and  Autocorrelation  methods  are  derived  in  the  time  and  frequency 
domains.  Both  methods  ^re  shown  to  be  derivable  from  a  more  gen¬ 
eral  concept,  that  of  generalized  analysis-by-synthesis,  where  a 
nonstationary  two-dimensional  spectrum  is  approximated  by  another 
model  spectrum.  Linear  prediction  analysis  is  a  special  case 
where  the  model  spectrum  is  all-pole.  Also,  under  the  assumption 
of  stationarity  the  general  Covariance  method  reduces  to  the 
Autocorrelation  method.  The  normalized  error  is  defined.  Its 
relation  to  the  cepstrai  zero  quefrency,  its  usefulness  as  a 
voicing  detector  and  as  a  determiner  of  the  optimum  number  of 
predictor  coefficients  are  discussed.  The  application  of  linear 
prediction  to  pitch  extraction  and  formant  analysis  is  carefully 
examined.  Specific  issues  discussed  include  the  adequacy  of  an 
all-pole  model  for  formant  extraction,  pitch-synchronous  and 
pitch-asynchronous  analysis,  windowing,  preemphasis,  and  formant 
extraction  by  peak  picking. 
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CHAPTER  I 
INTRODUCTION 

1 .  _  -'istorical  Overview 

One  of  the  most  important  methods  of  speech  analysis  has 
been  the  use  of  the  she  -tiirn  spectrum.  This  has  been  accom¬ 
plished  in  different  w<,  and  10  d~  .recent  ends  curing  the  past 
25  years.  The  first  maj<  bre«kt'  >:  u<  .  was  the  invention  of  the 
sound  spectrograph  (Koenig,  Dunn  j.  Lacey,  1946)  which  is  still 
used  extensively  for  the  spectral  analysis  of  speech.  In  1960, 

G.  Fant  published  the  classic  Acoustic  Theory  of  S  aec.i  Production 
which  laid  the  foundations  for  many  of  the  different  methods  of 
speech  analysis  that  followed.  As  a  direct  result  of  the  signifi¬ 
cant  advances  that  occurred  in  understanding  the  acoustics  of 
speech  production,  and  with  the  aid  of  high-speed  digital  compu¬ 
ters,  the  method  of  analysis-by-synthesis  was  given  new  impetus 
at  M.I.T.  (Bell,  Fujisaki,  Heinz,  Stevens  and  House,  1961).  A 
bank  of  36  band-pass  filters  was  used  in  their  analysis.  Another 
landmark  was  the  pitch-synchronous  analysis  of  voiced  sound.*  as 
reported  by  Mathews,  Miller  and  David  Q961)  at  Bell  Labs.  They 
actually  used  analysis-by-synthesis  on  the  spectrum  of  a  single 
pitch  period  obtained  by  a  Fourier  analysis  of  the  sampled  wave¬ 
form.  In  1964,  A.M.  Noll  introduced  the  cepstrum  for  the  purpose 
of  pitch  extraction.  The  cepstrum  was  later  used  as  the  basis  for 
a  formant  tracking  system  (Schafer  and  Rabiner,  1970) .  This  very 
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brief  review  gives  a  representative  sample  of  the  ideas  ana  metho¬ 
dologies  that  have  had  a  d'  j.nite  effect  on  the  types  of  speech 
analysis  that  many  speech  researchers  have  chosen  to  pursue.  A 
more  complete  review  can  be  found  in  Flanagan  (1972) . 

1.2  Linear  Prediction 

The  past  two  years  have  witnessed  a  surge  of  .interest  on  the 
part  of  the  speech  community  in  a  method  of  analyses  known  alter¬ 
nately  as  predictive  coding,  linear  prediction,  Prony's  method, 
inverse  filtering  forT.iulation,  etc.  This  surge  of  interest  has 
been  also  accompanied  by  an  air  of  confusion.  Two  main  reasons 
for  this  confusion  are: 

(1)  A  lack  of  exposition  on  the  similarities  and  differen¬ 
ces  between  different  formulations. 

(2)  A  resurfacing  of  some  of  the  problems  (e.g.  windowing, 
preemphasis,  etc.)  associated  with  accepted  methods  fcr 
computation  of  short-time  spectra. 

We  shall  attempt,  in  this  report-  to  deal  with  these  prob¬ 
lems  by  relating  a  few  of  these  formulations  to  each  other. 

Let  us  first  discuss  what  these  formulations  have  in  common. 
As  far  as  we  can  ascertain,  all  the  methods  we  have  inspected  have 
exactly  one  thing  in  common:  they  all  assume  that  at  a  particular 
instant  in  time,  a  speech  sample  s(nT)  can  be  approximated  by  a 
linearly  weighted  summation  of  the  past  p  samples,  where  p  is 
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some  integer. 


s  (nT)  -  ak  s  ^nT'~ 


kT) 


k=l 

P 


or 


'»  S  I  ak  Vk  • 


(1-1) 


k=l 


where  T  is  the  sampling  interval,  n  is  the  sample  number,  and  a^, 
lsk<p,  are  the  weights.  Equivalently,  given  p  samples  of  a  speech 
signal,  the  following  sample  can  be  predicted  approximately  by  a 
linear  summation  of  the  p  known  samples.  Hence  the  term  "linear 
prediction".  Henceforth  we  shall  use  the  term  "linear  prediction" 
as  a  generic  name  for  any  method  that  makes  an  assumption  equiva¬ 
lent  to  that  in  (1-1)  . 

The  problem  at  hand,  as  put  forth  by  linear  prediction,  is 
to  compute  a  set  of  predictor  coefficients  such  that  (1-1) 
holds  optimally  over  a  specified  period  of  time.  It  is  in  compu¬ 
ting  the  set  of  coefficients  a^  that  different  formulations  of 
linear  prediction  have  evolved. 

The  assumption  in  (1-1)  could  be  made  for  any  signal,  be  it 
speech  or  not.  The  reason  that  this  assumption  works  well  for 
speech  is  that  it  is  based  on  a  model  of  speech  production  which 
has  been  shown  to  work  quite  well  in  analysis-synthesis  systems 
(Fant,  1960).  Basically,  the  model  assumes  an  all-pole  transfer 
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function  of  the  combined  effects  of  the  glottal  source,  the 
vocal  tract  and  radiation.  These  poles  can  be  computed  by 
solving  a  polynomial  in  z  with  coefficients  a^.  A  more  detailed 
description  of  this  model  is  given  in  Chapter  II. 

Theoretically  there  exist  an  unlimited  number  of  ways  in 
which  to  compute  the  coefficients  s.y .  However,  we  shall  initially 
limit  our  discussion  to  three  formulations  which  we  feel  to  be 
representative  of  the  possible  methods  of  analysis .  and  which 
raise  some  interesting  issues.  VJe  shall  describe  briefly  each  of 
the  formulations  and  give  representative  references  on  each  with¬ 
out  attempting  to  give  a  complete  bibliography.  The  three  methods 
will  be  given  mnemonic  names  for  ease  of  reference. 

Exact  Method 

This  method  assumes  that: 

(a)  The  signal  is  defined  for  exactly  2p  consecutive  values. 

(b)  A  speech  sample  can  be  predicted  exactly  from  the  past 
p  samples,  and  that 

(c)  This  holds  for  the  trailing  d  consecutive  samples. 

These  assumptions  are  represented  by  the  following  set  of  equations 

P 

X  ak  sn-k  =  sn'  (1"2) 

k=l 


*1 


■- '  'v 
3 


TZ.  v' 
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These  are  p  equations  in  p  unknowns  which  in  general  can  be 
solved  for  the  coefficients  a^,  lSkSp. 

Covariance  Method 

This  method  assumes  that: 

(a)  The  signal  is  defined  for  p+N  consecutive  values, 
where  N  is  some  integer. 

(b)  A  speech  sample  can  be  approximately  predicted  from 
the  past  p  samples,  and  that 

(c)  This  holds  for  the  trailing  N  consecutive  samples. 

(d)  The  total- squared  error  between  the  real  signal  and 
its  predicted  value  is  minimized  over  tho  N  consecu¬ 
tive  samples.  (Some  prefer  to  use  the  mean-squared 
error  instead  of  total-squared  error.  The  difference 
in  this  case  is  a  division  by  a  constant  N  which  does 
not  affect  the  results  of  minimization.) 

The  minimization  of  error  results  in  the  following  set  cf  equa¬ 
tions  (detailed  derivation  is  shown  in  Section  3.1)* 


i 

Z  ak  ^ik  =  ^iO'  1=1'2'*  -*'P 
k=l 

N-l 

^ik  =  Z  Sn-i  sn-k  * 


(1-3) 


(1-4) 


) 
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Again  we  have  p  equations  in  p  unknowns  which  can  be  solved  to 
obtain  the  coefficients  ay,  l5kSp.  The  coefficients  form  a 
covariance  matrix ,  hence  the  name  "Covariance  Method."  Equa¬ 
tions  such  as  (1-3)  are  known  in  least-squares  terminology  as  the 
normal  equations  of  the  process  {Hildebrand,  1956,  p.  260).  In 
this  case  we  shall  call  (1-3)  the  Covariance  normal  equations, 
or  alternately  the  Covariance  normal  matrix  equation. 

Autocorrelation  Method 

The  assumptions  made  in  this  method  are: 

(a)  The  signal  is  defined  for  all  time  such  that  it  is 
identically  zero  outside  a  portion  of  the  signal  N 
samples  long,  where  N  is  some  integer.  This  is 
equivalent  to  multiplying  the  speech  signal  by  a 
finite  window  of  length  N. 

(b)  Each  sample  can  be  approximately  predicted  from  the 
past  p  samples,  and  that 

(c)  This  is  true  for  all  time. 

(d)  The  total-squared  error  between  the  actual  signal 
and  its  predicted  value  is  minimized  for  all  time. 

The  minimization  of  error  results  in  the  following  set  of  equa¬ 
tions  (the  derivation  is  given  in  Section  3.1): 


Z  ak  Ri i-Kl  a  Ri  '  1=1 


=  1,2, . 


{1-5) 


6 
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where 


N-l-UI 

Ri  Y  Sn  Sn+ 1  i  | . 

n=0 


(1-6) 


Again  (1-5)  forms  p  equations  with  p  unknov/ns  to  be  solved  for 


the  coefficients  a^. 

The  are  autocorrelation  coefficients  of  the  signal.  Th® 
coefficients  Rji-k|  form  a  special  .matrix  which  we  shall  call  the 
autocorrelation  matrix  (as  opposed  to  the  covariance  matrix  in 
the  Covariance  method) .  Also,  we  shall  call  equations  (1-5)  the 
Autocorrelation  normal  equations  or  alternately  the  Autocorrela¬ 
tion  normal  matrix  equation. 

As  we  shall  see  in  Chapter  IV,  there  are  other  possible  for¬ 
mulations  for  the  Covariance  and  Autocorrelation  methods.  The 
assumptions  made  above  do  not  all  apply  in  the  other  formulations. 
However,  all  Covariance- type  formulations  have  (1-3)  in  common, 
and  all  Autocorrelation- type  formulations  have  (1-5)  in  common, 
but  (1-4)  and  (1-6)  will  not  necessarily  apply. 

This  concludes  our  brief  description  of  each  of  three  formu¬ 
lations  for  linear  prediction.  Now,  we  shall  relate  the  work  of 
some  researchers  to  these  three  methods.  The  so-called  Prony's 
method  (Hildebrand,  1956,  p.  378)  or  the  exponential  approximation 
method  is  equivalent  to  the  Exact  method  for  N  =  p  and  to  the 
Covariance  method  for  N>p.  A  paper  by  Atal  and  Hanauer  (1971)  , 


l 
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which  deals  comprehensively  with  applications  of  linear  predic¬ 
tion  in  speech  analysis  and  synthesis,  makes  use  of  the  Covariance 
method.  The  Autocorrelation  method  can  be  traced  back  to  the  x 
classic  work  by  Wiener  on  linear  prediction  (Wiener,  1966) .  •  " 

Itakura  and  Saito  (1970)  using  a  maxim urn- likelihood  method  with 
a  statistical  model  of  speech  production,  derive  a  formulation 
which  is  equivalent  to  the  Autocorrelation  method.  The  digital  < 
inverse  filtering  formulation  given  by  Markel  (1972)  is  also  equi¬ 
valent  to  the  Autocorrelation  method.  Markel1 s  report  contains 
early  references  on  the  subject  and  explores  formant  tracking  as 
an  application.  Weinstein  and  Oppenheim  (1971)  have  used  linear 
prediction  in  a  homomorphic  vocoder,  and  it  seems  from  their 
paper  that  they  used  the  Autocorrelation  method  also. 

It  should  be  pointed  out  that  linear  prediction  has  had  ex¬ 
tensive  applications  in  other  fields.  For  example,  Flinn  (1972) 
gives  references  on  seismic  and  acoustic  applications.  We  quote 
from  the  introduction  to  the  special  issue  on  the  M.I.T.  Geophysi¬ 
cal  Analysis  Group  Reports  in  Geophysics  (Treitel  and  Robinson, 
1967)  : 

"The  applications  {of  predictive  decomposition)  to 
seismic  exploration  deal  with  the  model  in  which 
a  section  of  seismic  trace  is  given  as  the  convo¬ 
lution  of  a  random  spike  series  with  a  minimum- 
delay  waveform." 

As  we  shall  see,  the  problem  in  the  analysis  of  voiced  speech 
is  very  similar  except  instead  of  a  random  spike  series  (i.e. 
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impulses)  we  have  a  quasi-periodic  impulse  series.  These  seismic 
applications  have  used  the  Autocorrelation  method  of  linear  pre¬ 
diction. 

In  this  report  we  shall  investigate  in  detail  the  properties 
of  the  Autocorrelation  and  Covariance  methods  of  linear  predic¬ 
tion.  The  Euact  method  will  not  be  discussed  in  any  detail  be¬ 
cause  it  does  not  seem  to  have  wide  applicability  in  speech  analy¬ 
sis  (see  Section  2.2)  .  Of  all  three  methods  of  linear  prediction, 
we  believe  that  the  Autocorrelation  method  gives  the  speech  re¬ 
searcher  a  more  intuitive  feel  for  the  properties  of  linear  pre¬ 
diction  in  terms  of  traditional  concepts  such  as  Fourier  trans¬ 
formation  and  analysis-by-synthesis.  On  the  other  hand,  the 
Covariance  method  offers  new  and  exciting  possibilities  in  the 
analysis  of  speech  as  a  nonstationary  signal. 

1 . 3  Chapter  Summaries 

Basic  to  the  workings  of  linear  prediction  in  speech  analy¬ 
sis  is  an  appreciation  for  the  underlying  speech  production  model. 
The  all-pole  discrete  model  is  described  in  Chapter  II,  with  a 
critical  evaluation  of  its  adequacy  for  different  applications 
of  speech  analysis.  The  main  parameters  of  the  model  are  the 
predictor  coefficients.  These  coefficients  can  be  computed  from 
the  speech  signal  by  one  of  the  methods  of  linear  prediction. 

The  time-domain  derivation  of  the  Covariance  and  Autocorrelation 
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methods  and  methods  of  computing  the  predictor  coefficients  are 
the  subject  of  Chapter  III.  The  stability  of  the  resulting  linear 
predictor  is  also  discussed. 

Although  linear  prediction  has  become  popular  as  a  time- 
domain  analysis,  we  show  in  Chapter  IV  that  linear  prediction  can 
be  considered  equally  validly,  and  perhaps  better  understood,  as 
a  frequency-domain  analysis.  (In  reality,  linear  prediction  is 
an  autocorrelation-domain  analysis,  which  can  be  approached  either 
from  the  time  or  frequency  domain.)  The  formulations  for  the 
Covariance  and  Autocorrelation  methods  given  in  Section  1.2  are 
shown  to  be  as  special  cases  of  more  general  formulations.  We 
introduce  the  concept  of  generalized  analysis-by-synthesis  where 
the  2D-spectrum  (two-dimensional  spectrum)  of  a  nonstationary 
signal  (i.e.  its  statistics  change  with  time)  is  to  be  approxima¬ 
ted  by  another  2D-spectrum,  where  the  error  to  be  minimized  is 
proportional  to  the  integral  of  the  ratio  of  the  original  spec¬ 
trum  to  the  approximate  spectrum.  In  the  special  case  when  the 
approximate  spectrum  is  all-pole,  the  generalized  method  reduces 
to  the  general  Covariance  method  of  linear  prediction.  If,  in 
addition,  the  signal  is  assumed  to  be  stationary,  the  Covariance 
method  reduces  to  the  Autocorrelation  method.  The  general  Co- 
variance  and  Autocorrelation  methods  thus  derived  are  each  divided 
further  into  a  direct  and  an  indirect  method,  depending  on  whether 
the  autocorrelation  coefficients  are  computed  from  an  infinite 
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but  windowed  signal,  or  from  a  finite  and  unwindowed  portion  of 
the  signal,  respectively.  The  formulations  given  in  Section  1.2 
are  then  relabelled  as  the  indirect  Covariance  and  direct  Auto¬ 
correlation  methods. 

In  order  to  better  understand  the  manner  in  which  linear  pre¬ 
diction  operates,  we  analyze  in  Chapter  V  one  of  the  methods  in 
detail,  namely  the  direct  Autocorrelation  method.  We  examine 
the  manner  in  which  the  all-pole  spectrum  approximates  the  signal 
spectrum,  and  the  relation  between  the  all-pole  transfer  function 
and  the  signal  transfer  function,  especially  as  the  number  of 
poles  is  increased  indefinitely.  The  remainder  of  the  chapter  is 
devoted  to  a  detailed  analysis  of  the  normalized  error,  its  re¬ 
lation  to  the  zero  quefrency  (zero  coefficient  of  the  transform 
of  the  log  spectrum) ,  and  its  possible  usefulness  as  a  voicing 
detector  and  as  a  determiner  of  the  optimum  number  of  predictor 
coefficients  to  be  used  for  certain  applications. 

Finally,  in  Chapter  VI,  we  study  how  linear  prediction  can  > 
be  useful  in  pitch  extraction  and  formant  analysis.  Specific 
issues  discussed  include  the  adequacy  of  an  all-pole  model  for 
formant  extraction,  pitch-synchronous  and  pitch-asychronous  analy¬ 
sis,  windowing,  preemphasis,  and  formant  extraction  by  peak  picking. 

In  this  report  we  have  attempted  to  be  as  analytical  as  pos¬ 
sible,  but  without  losing  sight  of  the  applied  world.  The  theo*  * 
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is  seen  as  a  solid  basis  on  which  to  build  a  better  understanding 
of  how  best  to  apply  linear  prediction  to  the  analysis  of  speech. 
Thus,  instead  of  flooding  the  reader  with  examples  of  when  a  par~ 
ticular  method  works,  we  have  analyzed  in  detail  situations 
where  that  method  fails,  in  order  to  give  a  better  appreciation 
of  the  processes  involved. 
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CflAPTER  II 

DISCRETE  MODEL  OP  SPEECH  PRODUCTION 

We  mentioned  in  Section  1.2  that  the  reason  linear  predic¬ 
tion  works  well  in  the  analysis  of  the  speech  signal,  is  that  it 
is  based  on  a  model  of  speech  production  which  agrees,  to  a 
large  extent,  with  existing  theories  of  speech  production  (such 
as  Fant,  1960),  and  which  has  proven  to  be  a  good  practical  model 
in  speech  synthesis.  Here  we  shall  describe  this  model  of  speech 
production  (in  the  discrete  domain)  and  relate  it  to  the  three 
methods  of  linear  prediction  described  in  Section  1.2. 

2.1  Speech  Production  Model 


Speech  is  produced  as  a  result  of  the  excitation  of  a  time- 
varying  vocal  tract  shape.  The  speech  signal  is  in  general  a 
nonstationary  process,  i.e.  its  statistics  change  w ith  time.  The 
nonstationarity  is  a  result  of  changes  in  the  excitation  as  well 
as  in  the  vocal  tract  shape.  If  both  the  excitation  and  the  vo¬ 
cal  tract  shape  remain  fixed,  the  resulting  speech  signal  can  be 
considered  to  be  static  lary.  For  example,  uttering  the  vowel  [a] 
at  a  constant  pitch  and  intensity  level  produces  a  signal  that  is 
stationary.  Keeping  the  vocal  tract  shape  fixed  for  [a]  and  chang 
ing  the  pitch  with  time  (such  as  going  up  a  musical  scale)  pro¬ 
duces  a  signal  that  is  nonstationary.  In  general,  given  that 
some  process  is  the  output  of  a  linear  system,  the  process  is  sta- 
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tionary  if  the  system  is  time-invariant  and  the  input  (or  exci¬ 
tation)  is  stationary.  If  either  the  input  is  nonstationary  or 
tha  system  is  time-varying,  or  both,  the  output  process  is  non¬ 
stationary.  The  importance  of  the  question  of  stationarity  of 
the  speech  signal  will  become  evident  later. 

For  the  purposes  of  modeling  speech  production,  we  approxi¬ 
mate  the  continuously- varying  vocal  tract  shape  by  a  discretely- 
varying  vocal  tract  shape,  i.e.  a  vocal  tract  whose  shape  changes 
at  discrete  time  intervals.  Such  a  time  interval  shall  be  called 
a  "frame" .  Within  a  frame,  the  vocal  tract  shape  is  considered 
to  be  fixed  and  can  be  modeled  by  a  linear  time-invariant  filter. 
This  model  of  speech  production  has  been  used  effectively  in 
speech  synthesis  systems.  In  linear  prediction  the  linear  filter 
is  restricted  to  be  all-pole. 

Thus,  the  model  of  speech  production  used  in  linear  predic¬ 
tion  consists  of  the  following  three  assumptions: 

(1)  Within  a  short  interval  of  time  (on  the  order  of 
10-25  msec)  the  human  vocal  tract  is  assumed  co  be  fixed  in  shape. 

We  shall  refer  to  such  an  interval  as  a  “frame". 

(2)  Within  any  frame,  we  assume  that  the  transfer  function 

of  the  combined  effects  of  the  glottal  flow,  the  vocal  tract  (includ¬ 
ing  the  oral  and  nasal  cavities)  and  the  radiation  characteristic, 
can  be  modeled  by  a  linear  time-invariant  all-pole  filter  with 
either  a  sequence  of  impulses  or  white  noise  (or  a  combination 
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of  both)  as  input  (see  Fig.  2~1). 

(3)  The  speech  signal  can  be  considered  as  the  output  of 
such  an  all-pole  filter  whose  coefficients  change  at  discrete  in¬ 
tervals  of  time  (on  the  order  of  10  msec). 

Below  we  shall  focus  our  attention  on  a  single  frame  where 

the  all-pole  filter  is  assume*'  to  be  time-invariant.  Fig.  2~la 

shows  a  schematic  of  the  model  in  the  frequency  domain.  The  comp- 

plex  variable  z  is  defined  by: 

sT  (o+jw)T 
z  =  e  =  e  J 

where  s  =  a+jw  is  the  Laplace  operator, 

u)  =  2rrf  is  the  radian  frequency  in  rad/sec, 

a  is  the  damping  factor  in  rad/sec, 

T  =  ^  is  the  sampling  interval  in  seconds, 

and  f  is  the  sampling  frequency  in  Hz. 

s 

(A  brief  presentation  of  z-transforms  and  thc-jr  interpretation 
in  terms  of  traditional  Fourier  series  is  given  in  Appendix  A. ) 
Figure  2-la  is  interpreted  as  follows:  Speech  is  either  voiced, 
friected,  or  both.  (Throughout  this  report  we  shall  assume  that 
aspiration  is  a  kind  of  frication.)  Voiced  speech  is  produced 
by  applying  a  sequence  of  impulses,  spaced  at  the  pitch  period, 
to  a  digital  filter  of  the  form: 
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A. 

£» 


(Z) 


A  A 

p - *  kT^T 


k*l 


(2-2) 


where  a^,  l<k<p  are  the  filter  coefficients, 


and 


A  is  a  multiplicative  gain  factor  that  controls  the  signal 
amplitude. 


P 

H (z)  =  1-  £  ak  z~k 
k=l 


(2-3) 


is  the  inverse  filter. 

A 

The  output  of  the  filter  S(z)  is  s(nT),  the  speech  samples.  Fri- 
cated  speech  is  produced  by  applying  a  sequence  of  white  noise 
samples,  spaced  T  seconds  apart,  to  a  filter  of  the  form  S(z). 
Voiced  fricatives  are  produced  by  a  combination  of  voicing  and 

A 

frication.  The  filter  S(z)  represents  the  combined  transfer  func¬ 
tion  of  the  glottal  flow,  the  vocal  tract  and  radiation.  The  poles 
of  the  filter  S(z)  can  be  determined  by  solving  for  the  roots  of 

A 

the  polynomial  in  z  in  the  denominator  cf  S(z). 


Representing  the  z-transforms  of  s(nT)  and  u(nT)  by  S(z) 
and  U(z) ,  respectively,  we  can  write  from  Fig.  2-la: 

S(z)  =  U (z)  S  (z) 

=  _  A  U(z)  .  (2-4) 

* 


k=l 
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■=tion  (2-4)  can  be  rewritten  as: 

P 

S  (.z )  =  S(z)  ^  a^  z  '  .-A  U(z)  . 

k=l 


(2-5) 


Taking  the  inverse  z-transform  of  (2-5)  we  obtain: 

P 


s  (nT)  =  ak  s  (nT-kT)  +A  u(nT) 

k=l 


or 


P 

C' 


n 


>  a,  s  ,  +A  a 
L,  k  n-k  n 

k=l 


(2-6) 


where  T ,  the  sampling  interval,  has  been  omitted  in  (2-6)  but 
is  still  implied, 

Equation  (2-6)  is  the  time-domain  counterpart  to  (2-4),  and  it 
represents  the  speech  production  model  in  the  discrete  time  do¬ 
main.  t\  schematic  of  the  time-domain  model  is  shown  in  Fig.  2-lb, 
It  should  be  clear  that  the  systems  in  Figs.  2-la  and  2-lb  are 
equivalent. 

2.2  Use  of  the  Model  in  Linear  Prediction 


Note  from  (2-6)  that  except  for  contributions  by  the  in¬ 
put  u(nT),  the  signal  s(nT)  is  produced  by  a  ] inear  summation  of 
the  past  p  rumples.  In  trying  to  fit  the  model  of  Fig.  2-1  to  a 
real  speech  signal  we  encounter  the  problem  of  not  knowing  what 
the  input  signal  u(nT)  looks  like.  For  example,  we  don’t  know 


Report  No.  2304  Bolt  Beranek  and  Newman  Inc. 

a  priori  whether  the  speech  signal  is  voiced  or  unvoiced.  Even 
if  we  know  that  the  signal  s(nT)  is  likely  to  be  voiced,  we  do 

not  know  the  exact  times  of  occurrence  of  the  impulses  in  u(nT). 
Therefore,  in  linear  prediction  we  first  let  u(nT)  be  an  unknown 
(actually,  the  Exact  method  described  in  Section  1.2  assumes  that 
u(nT)=0)  and  assume  that  (1-1)  holds,  i.e.  we  assume  that  s(nT) 
can  be  approximated  by  a  linear  summation  of  the  past  p  samples. 
After  the  determination  of  the  coefficients  a^,  l?k<p,  we  can 
then  determine  A  by  energy  considerations,  and  we  can  also  make 
certain  statements  about  u(nT).  (Normally,  u(nT)  is  of  interest 
only  for  voiced  sounds  since  it  gives  information  concerning  the 
periodicity  (pitch)  of  the  speech  signal.)  Indeed,  after  some 
knowledge  of  the  position  of  the  pitch  pulses  in  time,  one  could 
use  that  information  to  get  a  better  estimate  of  the  coefficients 

V 

As  mentioned  above,  the  Exact  method  of  linear  prediction 
assumes  that  u(nT)=0  for  all  n.  In  general,  this  is  not  a  good 
assumption  for  speech  unless  one  is  sure,  for  example  that  there 
are  no  pitch  pulses  (in  a  voiced  segment)  during  the  time  interval 
corresponding  to  the  2p  speech  samples  needed  for  the  analysis. 

For  this  reason  one  does  not  expect  very  good  results  using  the 
Exact  method  of  analysis.  We  know  of  no  researcher  who  has  used 
this  method  to  analyze  speech  in  any  extensive  manner. 
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On  the  other  hand,  both  the  Covariance  and  the  Autocorre¬ 
lation  methods  of  analysis  (see  Jection  1.2)  admit  that  linear 
prediction  produces  an  error  which  they  proceed  to  minimize  in 
the  least-squares  sense.  The  difference  between  the  two  methods 
lies  in  the  definition  of  what  the  signal  is  and  in  the  region 
of  error  minimization.  This  difference  can  be  interpreted  in 
terms  of  the  stationarity  of  the  speech  signal.  In  the  speech 
production  model  given  in  Section  2.1  the  vocal  tract  was  modeled 
by  a  linear  time-invariant  system  for  a  single  frame  of  speech. 
Within  that  frame,  the  signal  s(nT)  in  Fig.  2-1  can  still  be 
either  stationary  or  nonstationary  depending  on  the  input  u(nT). 
As  we  sh-.ll  see  in  Chapter  IV,  the  Autocorrelation  method  assumes 
the  signal  s  (nT)  to  be  stationary,  while  the  Covariance  method 
assumes  the  signal  to  be  nonstationarv  within  a  single  frame. 

2 . 3  Adequacy  of  the  Model 

We  have  mentioned  that  methods  of  linear  prediction  im¬ 
plicitly  rely  on  the  all-pole  model  of  the  vocal  tract,  qlottal 
flow  and  radiation.  The  question  is  to  what  extent  this  model 
is  adequate  and  for  what  applications.  We  shall  compare  this 
model  with  standard  models  of  speech  production  described  in 
Fant  (1960)  and  Flanagan  (1965). 

For  nonnasal  sor.orant  sounds,  the  transfer  function  of  the 
vocal  tract  is  generally  known  to  have  only  poles  (resonances) 
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and  no  2eros  (antiresonances).  Therefore,  for  these  sounds  an 
all-pole  model  of  the  vocal  tract  is  adequate.  On  the  other 
hand,  for  nasal  and  fricative  sounds  the  transfer  function  of 
the  vocal  tract  is  considered  to  have  zeros  as  well  as  poles. 

This  means  thac  the  zeros  are  being  approximated  by  poles  in  the 
linear  prediction  model.  Mow,  these  zeros  lie  within  the  unit 
circle  in  the  z-plane  (Atal  and  Hanauer,  1971,  p.  638),  and  each 
zero  can  be  replaced  theoretically  by  an  infinity  of  poles.  This 
is  done  by  noting  that  a  zero  (1-az  *)  inside  the  unit  circle 
(i.e.  | a | < 1 ) ,  can  be  expanded  (by  long  division  into  1)  as: 

l-as-1  =  — - - - .  (2-7) 

-1  2  -2 
1+az  +a  z  +. . . 

Now,  one  could  argue  that  the  effect  of  a  zero  can  be  approxi¬ 
mated  by  a  finite  number  of  poles  and,  hence,  an  all-pole  model 
would  also  be  adequate  for  nasal  and  fricative  sounds.  However, 
it  is  not  clear  how  the  poles  that  are  approximating  the  zeros 
interact  with  the  genuine  poles  (formants).  What  is  likely  to 
happen  is  that  in  trying  to  apply  the  all-pole  model  to  nasals 
and  fricatives,  the  antiresonances  in  those  sounds  will  have  the 
effect  of  shifting  the  positions  and  bandwidths  of  the  formants 
as  computed  from  the  model.  (This  effect  is  discussed  in  Sec¬ 
tion  6.2.)  For  example,  consider  a  particular  all-pole  transfer 
function  (computed  by  some  linear  prediction  method)  which  appro- 
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ximates  that  of  the  vocal  tract  for,  say,  a  nasal.  Not  only  is 
it  unclear  how  one  would  go  about  locating  the  zeros  (if  any) ,  but 
the  computed  positions  and  bandwidths  of  formants  close  to  those 
zeros  will  be  different  from  the  "actual"  values.  In  other  words, 
if  one  is  interested  in  locating  tne  positions  of  the  anti-for¬ 
mants  as  well  as  the  formants  in  a  nasal  or  fricative,  then  linear 
prediction  may  not  be  adequate.  This  can  be  important  for  appli¬ 
cations  such  as  speech  recognition.  On  the  other  hand,  if  one  is 
interested  in  using  the  results  of  the  analysis  for  speech  syn¬ 
thesis  then  the  all-pole  model  is  quite  adequate.  The  reason  for 
this  lies  partly  in  the  fact  that  the  human  perceptual  system  is 
much  more  sensitive  to  the  location  of  a  pole  than  to  the  loca¬ 
tion  of  a  zero  (Matsuda,  1966;  Flanagan,  1965,  p.  215).  Another 
reason  may  be  that  the  human  ear  is  sensitive  to  the  general  en¬ 
velope  of  the  spectrum,  and  it  does  not  matter  in  what  manner 
that  spectrum  was  generated.  As  we  shall  see  in  Chapter  IV,  linear 
prediction  guarantees  a  good  spectral  envelope  fit  to  a  short- 
time  spectrum.  Speech  synthesizers  that  have  used  all-pole  fil¬ 
ters  to  generate  sounds  that  normally  contain  zeros  show  that  an 
all-pole  model  is  quite  adequate  for  speech  production  (Schafer 
and  Rabiner,  1970;  Atal  and  Hanauer,  1971;  Klatt,  1972)  although 
Mermelstein  (1972)  reports  that  an  all-pole  formulation  intro¬ 
duces  a  noticeable  decrease  in  naturalness.  (The  adequacy  of  an 
all-pole  model  for  the  purpose  of  speech  recognition  will  be 


22 


Report  No.  2304 


Bolt  Beranek  and  Newman  Inc 


discussed  in  Section  6.2.) 

There  remain  the  effects  of  radiation  and  glottal  pulse 
shape.  The  effect  of  the  radiation  at  the  mouth  and  nostrils  can 
be  approximated  by  a  zero  at  d.c.  (Flanagan,  1965,  p.  33),  or  in 
z-transform  notation:  (1-z  .  The  spectrum  of  the  glottal  vol¬ 

ume  velocity  is  characterized  by  a  large  number  of  zeros  (Flanagan, 
1965,  p.  44;  Mathews  et  al.,  1961),  but  the  general  shape  of  the 
glottal  spectrum  can  be  approximated  by  two  or  three  poles. 

Martony  (1965)  found  that  the  slope  of  the  glottal  spectrum  be¬ 
tween  500-3000  Hz  varies  between  -12  and  -18  dB/octave,  depending 
on  the  individual.  The  net  effect  of  the  zero  due  to  radiation 
and  one  of  the  poles  approximating  the  glottal  source  can  be  ap¬ 
proximated  (in  the  z-plane)  by  a  pole  on  the  negative  real  axis 
inside  the  unit  circle  (Atal  and  Hanauer,  1971) .  (The  effect 
on  the  spectrum  of  such  a  pole  is  described  in  Appendix  A.) 

Hence,  roughly  speaking,  the  combined  effects  of  radiation  and 
glottal  source  can  be  approximated  by  two  or  three  poles.  There¬ 
fore,  the  linear  prediction  model  seems  to  be  adequate.  It 
should  be  noted  that  the  perceptual  effect  due  to  the  glottal 
source  is  generally  associated  with  the  naturalness  of  speech 
and  the  characteristics  of  the  speaker.  Its  effect  on  the  identi¬ 
fication  of  speech  sounds  does  not  seem  to  be  of  major  importance 
(Flanagan,  1965,  p.  199). 
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2.4  Determination  of  the  Number  of  Poles  p 

In  the  linear  prediction  model  of  speech  production  shown 
in  Fig.  2-1  the  transfer  function  is  assumed  to  have  a  certain 
number  of  poles  p.  Ideally,  the  value  of  p  should  change  from 
one  speech  frame  to  another  depending  on  the  number  of  poles 
needed  to  represent  each  sound.  In  order  to  get  an  idea  on  the 
order  of  magnitude  of  p  we  shall  take  a  specific  example. 

Generally  for  males,  the  average  number  of  formants  in  a 
5  kHz  bandwidth  is  five.  For  example,  for  the  sound  [a]  the  vocal 
(oral)  tract  can  be  approximated  by  a  tube  open  at  one  end  and 
closed  at  the  other.  If  the  length  of  the  tract  is  17  cm  then  the 
natural  resonances  of  the  tube  will  occur  at  Fn=  - ,  where 

c=340  meters/sec  is  the  velocity  of  sound  in  air,  and  L=17  cm  is 
the  vocal  tract  length.  Therefore  in  a  5  kHz  region  we  have  the 
five  formants  bOO,  1500,  2500,  3500,  and  4500  Hz.  Since  each 
formant  comprises  a  pair  of  complex  conjugate  poles,  the  number 
of  poles  necessary  to  represent  such  a  vocal  tract  is  10.  [Atal 
and  Hanauer  (1971,  p.630)  derive  the  same  number  from  a  different 
point  of  view.]  Now,  we  mentioned  in  Section  2.3  that  two  or 
three  poles  are  adequate  to  represent  the  effects  of  the  glottal 
flow  and  radiation.  Therefore,  the  value  of  p  should  be  approxi¬ 
mately  12  or  13.  However,  we  have  so  far  neglected  one  other 
factor  which  should  have  an  effect  on  the  value  of  p,  and  that  is 
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the  fact  that  the  poles  are  realized  digitally.  This  has 
a  side  effect  which  is  discussed  below. 

Theoretically,  the  number  of  resonances  of  the  vocal  tract 
is  infinite.  Analog  formant  synthesizers  employing  a  fixed  num¬ 
ber  of  formants  (usually  5)  must  compensate  for  higher  frequency 
formants  by  what  is  known  as  the  higher-pole  correction  (Fant,  1960). 
However,  this  higher-pole  correction  is  not  necessary  in  digital 
formant  synthesizers  because  of  the  periodic  frequency  response 
of  a  digital  formant  network  (Gold  and  Rabiner,  1968).  As  a  re¬ 
sult,  the  10  poles  necessary  to  represent  the  vocal  tract  transfer 
function  in  a  5  kHz  bandwidth  can  be  realized  digitally  without 
the  need  for  compensation.  On  the  other  hand,  the  above  reasoning 
cannot  be  applied  validly  to  digital  implementation  of  the  poles 
representing  the  glottal  flow  and  radiation.  The  periodicity  of 
the  digital  network  response  is  equivalent  to  an  aliasing  effect 
which  can  cause  an  error  in  the  response  of  a  single  low-frequency 
pole  by  as  much  as  4  dB  at  5  kHz  (see  Appendix  A).  On  the  average, 
the  error  is  on  the  order  of  2  dB  at  5  kHz  (Gold  and  Rabiner,  1968). 
This  is  true  for  each  of  the  two  or  three  poles  representing  the 
glottal  flow  and  radiation.  Therefore,  in  order  to  compensate 
for  this  cumulative  error  one  must  introduce  at  least  one  extra 
pole.  The  value  of  p  now  becomes  approximately  13  to  14. 
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The  above  estimate  for  p  assumes  that  the  signal  was  sam¬ 
pled  at  10  kHz.  For  other  sampling  frequencies  the  value  of  p 
is  roughly  equal  to: 

p  =  2Nf  +  Nr  (2-8) 

where  is  the  number  of  formants  expected  in  a  frequency  range 
equal  to  half  the  sampling  frequency,  and  Nr  is  the  number  of 
real  poles  needed  to  represent  the  effects  of  the  glottal  flow 
and  radiation.  We  have  seen  above  that  N  is  approximately  equal 
to  3  or  4,  independent  of  the  sampling  frequency.  For  nonnasal 
sonorants,  formants  occur  at  the  rate  of  about  one  formant  per 
1  kHz  of  bandwidth  (for  male  speakers).  Therefore,  (2-8)  reduces 
to: 

p  =  fg  (kHz)  +  N  (nonnasal  sonorants)  (2-9) 

where  fg  is  the  sampling  frequency  in  kHz,  and  Nr  is  equal  to 
3  or  4. 

Equations  (2-8)  and  (2-9)  assume  that  the  vocal  tract  can 
be  approximated  adequately  by  a  number  of  poles.  In  particular, 
(2-9)  ..s  useful  mainly  for  nonnasal  sonorants.  Other  sounds, 
such  as  nasals  and  fricatives,  are  best  represented  by  a  combina¬ 
tion  of  zeros  and  poles.  Below,  we  shall  discuss  nasals  as  an 
example  of  sounds  with  zeros  as  well  as  poles. 
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Nasal  poles  correspond  to  the  resonances  of  the  nasal  tract, 
while  the  zeros  are  due  to  the  coupling  to  the  mouth  cavity.  For 
an  uncoupled  nasal  tract,  there  are  no  zeros  and  the  average  spac¬ 
ing  of  nasal  formants  is  about  800  Hz  for  a  male  speaker.  (Com¬ 
pare  this  with  1000  Hz  for  vowels;  the  difference  is  due  to  the 
fact  that  the  nasal  tract  is  longer  than  the  oral  tract.)  These 
formants  usually  have  higher  bandwidths  than  vowel  formants  be¬ 
cause  of  greater  losses  in  the  nasal  cavity.  From  (2-8)  we  con¬ 
clude  that  the  number  of  poles  needed  to  represent  the  uncoupled 
nasal  system  is  approximately: 


p  =  1 . 2f  (kHz)  +  N  . 

b  L 


(2-10) 


The  velar  nasal  t  n 1  can  be  reasonably  approximated  by  an  uncoupled 
nasal  tract  up  to  5  kHz,  and  (2-10)  would  be  applicable.  On  the 
other  hand,  [m]  and  [n]  have  important  antiformants  in  that  fre¬ 
quency  range.  Each  antiformant  causes  one  of  the  nasal  formants 
to  split  into  two  formants,  thus  forming  what  might  be  called  a 
"formant  cluster"  (Fujimura,  1962).  A  nasal  formant  cluster,  then, 
consists  of  two  formants  and  one  antiformant  in  the  same  region. 

In  the  frequency  range  up  to  3000  Hz,  [g]  has  four  formants;  [m] 
is  obtained  when  the  second  formant  is  replaced  by  a  cluster  con¬ 
sisting  of  two  formants  and  one  antiformant,  and  [n]  is  obtained 
when  the  third  formant  is  replaced  by  a  similar  cluster  (Fujimura, 
1962).  The  position  of  the  antiformant  with  respect  to  the  two 
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formants  in  the  cluster  is  quite  variable,  depending  on  the  spea-  ; 

ker  and  the  phonetic  context.  If  every  antiformant  happened  to 
coincide  with  one  of  the  two  formants  in  its  cluster,  then  (2-10)  .  ! 

would  still  apply.  However,  in  general,  that  is  not  the  case;  , 

r 

indeed  the  opposite  is  true.  More  importantly,  a  small  shift  in 

the  position  of  a  zero  with  respect  to  neighboring  poles  has  dras-  { 

v  j 

tic  effects  on  the  shape  of  the  spectrum.  This  is  important  since 

,  i 

linear  prediction  is  basically  a  spectral  matching  process.  ■  j 

In  trying  to  estimate  a  theoretical  value  for  p  in  the  case  i  j 

where  zeros  (or  antiformants)  exist,  we  attempted  to  approximate 
a  spectral  antiformant  (complex  conjugate  pair  of  zeros)  by  a  j 

number  of  poles.  We  found  that  we  needed  at  least  10  poles  (10 

\ 

kHz  sampling)  to  get  a  rough  spectral  match  to  a  single  anti¬ 
formant  that  is  typical  for  nasals  and  fricatives.  This  nur.\ er  1 

i 

would  have  to  be  added  to  (2-10)  in  order  to  get  a  good  estimate 

for  what  p  should  be  to  represent  a  nasal  whose  zero  does  not  •  • 

interact  with  neighboring  poles.  The  number  would  have  to  be 

I 

decreased  with  increased  interaction.  In  the  limit  when  the  zero 
cancels  a  pole,  (2-10)  would  aoply  as  is.  Since  there  is  no  a 
priori  way  to  determine  the  position  of  a  zero  with  respect  to 
neighboring  poles,  there  is  no  way  of  getting  a  good  theoretical 
estimate  for  p.  However,  practical  estimates  for  p  do  exist  de¬ 
pending  on  tne  application.  In  Sections  5.6  and  6.2  we  shall  1 

argue  that,  although  the  "optimum”  value  for  p  depends  on  the  j 
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type  of  sound  as  well  as  the  individual  speaker,  a  suboptimal 
value  is  usually  adequate  for  many  applications. 
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CHAPTER  III 

LINEAR  PREDICTION  ANALYSIS 

In  this  chapter  we  shall  derive  in  the  time-domain  the  Covar¬ 
iance  and  Autocorrelation  normal  equations  (1-3)  and  (1-5)  and 
suggest  algorithms  for  computing  the  predictor  parameters.  Given 
the  normal  equations,  the  minimum  squared  error  is  defined.  The 
stability  of  the  linear  predictor,  an  important  issue  for  speech 
synthesis,  will  then  be  examined  for  the  three  formulations  of 
linear  prediction.  We  then  take  a  look  at  some  autocorrelation- 
domain  properties  of  linear  prediction.  A  method  for  the  computa¬ 
tion  of  the  gain  factor  A  in  S(z)  will  be  specified. 

3.1  Derivation  of  Covariance  and  Autocorrelation  Normal  Equations 
Following  the  linear  prediction  speech  production  model  des¬ 
cribed  in  Section  2.1  and  represented  by  (2-6),  we  shall  assume 
that  a  sampled  speech  signal  s(nT)  at  time  t=nT  can  be  approxi¬ 
mately  predicted  by  a  linear  weighted  summation  of  the  past  p 
samples.  Let  this  approximation  to  s(nT)  be  s(nT).  We  have: 

P 

5n  "  £  ak  sn-k  '  (3-1! 

k=l 

where  l5ksp,  is  a  set  of  real  constants  representing  the  pre¬ 
dictor  coefficients,  and  p  is  some  integer  whose  value  is  deter¬ 
mined  as  described  in  Sections  2.4  and  5.6. 
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Let  the  error  between  the  actual  value  and  the  predicted  value 
be  given  by  en,  where: 


Z  ak  Sn-k  • 

k=l 


(3-2) 


The  problem  is  to  find  a^,  lsk5p,  such  that  the  error  en  is  mini¬ 
mized  in  some  sense  over  the  desired  range  of  signal  samples. 

Both  the  Covariance  and  Autocorrelation  methods  employ  a  least- 
squares  minimization  procedure  since  it  leads  to  a  mathematically 
attractive  solution.  Denote  the  total-squared  error  by  E,  de¬ 
fined  as: 


=  Z  en  =  Z  (5n  -  5n>  2  • 


(3-3) 


The  range  over  which  the  summation  in  (3-3)  applies  and  the  defi¬ 
nition  of  sn  in  that  range  is  of  importance.  Indeed,  this  is  ex¬ 
actly  where  the  difference  between  the  Covariance  and  Autocorrela¬ 
tion  methods  lies.  However,  let  us  first  minimize  E  without 
specification  of  the  range  of  the  summation.  Substituting  (3-1) 
in  (3-3)  we  obtain: 


E  “  Z!S»  -  ZaK  Sn-k)2  • 


(3-4) 
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The  problem  reduces  to  finding  the  condition  that  minimizes  the 
total-squared  error  E  with  respect  to  a^,  lsk5p.  This  condition 
is  obtained  by  setting  to  zero  the  partial  derivative  of  E  with 
respect  to  each  a^: 


3E 

3a, 


=  T 


/,  2 (sn 


i. 

-z 


ak  sn-k) (-sn-i) 


=  0, 


(3-5) 


n 


k=l 


or. 


^snsn-i  "Z  Z  a*  Sn-K  Sn-i  =  °'  l£i=p-  (3_6) 


n  k-1 


Rearranging  terms  and  interchanging  summations  we  obtain: 


Z  ak  Z  S*-k  S"-i  "  Z  Sn  sn-i'  l£i£p-  (3'7> 

k=l  n  n 

Equations  (3-7)  are  known  as  the  normal  equations.  For  any  defi¬ 
nition  of  the  signal  sn,  (3-7)  forms  a  set  of  p  equations  with  p 
unknowns  which  can  be  solved  for  the  predictor  coefficients  ak. 
Now,  we  shall  derive  the  Covariance  and  Autocorrelation  normal 
equations  from  (3-7). 


Coj  ariance  Normal  Equations 

Referring  back  to  the  assumptions  of  the  Covariance  method 
in  Section  1.2,  the  summation  over  n  in  (3-3)  and  hence  in  (3-7) 
must  go  over  N  consecutive  signal  samples.  Without  loss  of 
generality,  we  Jet  the  range  of  summation  over  n  be:  n=0, 1, . . . ,N-1. 
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& 

I 


We  can  now  write  (3-7)  as: 


P 

E  ak  ^ik  =  ^i0  ' 

k-1 


wher 


ik 


N-l 


■I 


s  .  s  . 
n-i  n-k 


n-0 


(3-8) 


(3-9) 


I 

I 

I 

I 

I 

I 

s 

1 

i 


Note  that  (3-8)  and  (3-9)  are  identical  to  (1-3)  and  (1-4) ,  and 
the  derivation  of  the  Covariance  normal  equations  is  complete. 
From  (3-8)  and  (3-9)  we  note  that  values  of  sn  for  n=-p,...,-l, 
0,1,..., N-l,  must  be  known.  Therefore  the  signal  sn  must  be  de¬ 
fined  for  p+N  consecutive  values,  as  stated  in  Section  1.2. 


Autocorrelation.  Normal  Equations 

From  the  assumptions  in  Section  1.2  we  can  define  the  signal 

s  as  follows: 
n 


sn  =' 


jsome  sampled  signal,  n=0,l, . . . ,N-1, 
jo*  otherwise. 


(3-10) 


The  windowed  signal  sn  is  defined  for  all  n:  -“<n<+°°.  Equation 
(3-7)  becomes: 


(3-11) 
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3.2  Computation  of  Predictor  Parameters 

In  each  of  the  three  formulations  of  linear  prediction  pre¬ 
sented  in  Section  1.2  (oqs.  1-2,  3-8,  3-15),  the  predictor  coef¬ 
ficients  a^,  l<k<p,  can  be  computed  by  solving  a  set  of  p  equa¬ 
tions  with  p  unknowns.  There  exist  several  standard  methods  for 
performing  the  necessary  computations,  e.g.  the  Gauss  reduction 
cr  elimination  method  and  the  Crout  reduction  method  (Hildebrand, 
1956,  pp.  428-434).  These  methods  are  general  and  can  be  used 
with  the  Exact,  Covariance  and  Autocorrelation  formulations.  How¬ 
ever,  we  note  from  the  Covariance  and  Autocorrelation  normal  equa¬ 
tions  (3-8)  and  (3-15)  that  the  matrix  of  coefficients  in  each 
case  is  a  covariance  matrix.  The  coefficients  <j>^k  in  (3-8)  form 
a  typical  covariance  matrix  and  the  coefficients  R j  |  in  (3-15) 
form  a  special  type  of  covariance  matrix  known  as  an  autocorrela¬ 
tion  matrix.  A  covariance  matrix  is  symmetric  and  in  general 
positive  semidefinite,  but  in  practice  these  covariance  matrices 
are  usually  positive  definite.  Therefore,  (3-8)  and  (3-15)  can 
be  solved  more  efficiently  by  the  square-root  method  (Kur.z ,  1957, 
pp.  222-225) .  This  method  also  requires  about  half  the  storage 
of  the  general  methods.  A  similar  method  that  does  not  employ 
the  square  root  operation  has  been  reported  by  Wilkinson  and 
Reinsch  (1971,  pp.  9-30) .  Further  reduction  in  storage  and  com¬ 
putation  time  is  possible  in  solving  the  Autocorrelation  normal 
equations  because  of  their  special  form.  Equation  (3-15)  can  be 
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expanded  in  matrix  form  as: 


* 

r  ** 

r 

R0  R1  R2 

Vi 

ai 

Rl 

R1  R0  R1  *  *  * 

V2 

a2 

R2 

^2  ^2  R0  •  •  • 

R  , 
P“3 

a3 

R3 

•  •  • 

• 

• 

• 

•  •  • 

• 

• 

• 

•  •  ♦ 

• 

• 

• 

R  *  R  n  i  »■>  ■  •  ■ 

p-i  p-2  p-3 

Ro 

R 

P 

.  y 

(3-17) 


Note  that  the  p  x  p  autocorrelation  matrix  is  symmetric  and  the 
elements  along  any  diagonal  parallel  to  the  principal  diagonal 
are  identical.  This  type  of  matrix  is  also  known  as  a  Toeplitz 
matrix  (Grenander  and  Szego,  1958).  Equation  (3-17)  can  be  solved 
recursively  by  Robinson’s  method  (Robinson,  1967b,  pp.  274-279) 
which  is  a  reformulation  of  a  method  by  Levinson  (1947) .  A  flow 
chart  for  this  method  is  given  by  Markel  (1972),  Robinson's  meth¬ 
od  assumes  the  column  matrix  on  the  right  hand  side  of  (3-17)  to 
be  a  general  column  matrix.  By  making  use  of  the  fact  that  this 
column  matrix  comprises  the  same  elements  found  in  the  autocor¬ 
relation  matrix,  another  method  emerges  which  is  twice  as  fast  as 
Robinson’s.  This  faster  method  has  been  derived  by  several  people 
and  was  reported  recently  by  Itakura  and  Saito  (1971) .  A  deriva¬ 
tion  and  a  flow  chart  of  the  Fast  Autocorrelation  method  can  be 
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found  in  Appendix  B  of  this  report.  This  derivation  employs  the 
theory  of  orthogonal  polynomials  in  z,  as  developed  by  Grenander 
and  Szego  (1958) . 

Figure  3-1  shows  a  comparison  between  the  Gauss  elimination 
method,  the  square-root  method,  and  the  Fast  Autocorrelation 
method,  in  terms  of  storage  and  computation.  The  computation  is 
represented  by  the  total  number  of  multiplications  and  divisions 
needed  for  the  solution.  (Each  square  root  in  the  square-root 
method  is  represented  by  3  computations.)  The  formulas  for  the 
Gauss  and  square-root  methods  were  taken  from  Ralston  (1965,  pp. 
401,  410,  452,  462).  The  formulas  for  the  Fast  Autocorrelation 
method  were  derived  from  the  flow  chart  in  Appendix  B.  For  p=14, 
the  computation  comparisons  between  the  Fast  Autocorrelation 
method,  the  square-root  method  and  the  Gauss  elimination  method, 
are  in  the  ratio  of  1  :  3,2  ;  5.3,  while  the  storage  requirements 
are  in  the  ratio  of  1  ?  3.8  :  7.  These  values  must  of  course  be 
taken  as  approximate.  It  should  be  pointed  out  that  the  solution 
of  the  normal  equations  for  the  predictor  coefficients  a^  is  usu¬ 
ally  only  a  small  fraction  of  the  total  amount  of  computation 
that  is  involved  in  the  analysis.  For  example,  in  order  to  com¬ 
pute  the  autocorrelation  coefficients  from  the  signal,  it  takes 
on  the  order  of  pN  computations,  where  N  is  the  number  of  samples 
in  the  signal.  For  a  10  kHz  sampled  signal,  N  coaid  be  anywhere 
between  100  and  30 u  depending  on  the  application  and  the  method 


Report  No.  2304 


Bolt  Beranek  and  Newman  Inc 


Storage 

Computation 

Gaussian  Elimination 

2 

p 

|(2p2+6p-2) 

Square- Root  Method 

|(p2+6p+ll) 

Fast  Autocorrelation  Method 

2p 

p(p+l) 

Fig.  3-1.  Approximate  storage  and  computational  requirements 
for  three  methods  of  solving  p  simultaneous  linear 
equations.  The  column  under  computation  shows  the 
total  number  of  multiplications  and  divisions  re¬ 
quired.  A  square-root  is  represented  by  3  compu¬ 
tations  . 
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of  linear  prediction  used.  If  N-150  in  the  Autocorrelation 
method,  then  it  takes  10  times  as  much  computation  to  compute 
the  autocorrelation  coefficients  as  to  compute  the  predictor  co¬ 
efficients  using  the  Fast  Autocorrelation  method. 


I 

I 

I 

I 

I 

I 

1 

s 

% 


3. 3  Minimum  Total-Squared  Error 

The  predictor  coefficients  ak  are  determined  such  that  the 
total-squared  error  E  in  (3-4)  is  minimized.  After  computation 
of  the  coefficients  a^  using  one  of  the  methods  mentioned  in 
Section  3.2,  one  should  be  able  to  compute  the  minimum  total- 
squared  error  Ep  by  substituting  for  the  computed  coefficients 
in  (3-4) .  (Note  that  there  is  no  error  criterion  associated 
with  the  Exact  method.)  Thus: 


=X.  (•»  -t a*  sn-> 


n  i  k=l 


P  P 


=  V  s2  -  2  s  a,  s  ,  +  V*  a,  a.s  .  s 
Z_i  I  n  n/j  k  n-k  k  i  n-K  n-i 

n  L  k=l  k=l  i=l 


=i<-  iifn-.Vi. 

n  k=l  n  k=l  i=l  n 


Substituting  (3-7) ,  the  condition  for  the  minimization  of  E,  and 
collecting  terms,  we  obtain  the  minimum  total- squared  error  Ep: 
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o 


EP  =ISn  -  X>k£>n  S"-* 


(3-18) 


n 


k=l  n 


In  particular,  for  the  Covariance  method,  n  ranges  from  0  to  N-l. 
Thus,  substituting  (3-9)  in  (3-18)  we  obtain  the  minimum  total- 

squared  error  in  the  Covariance  method: 

P 

Ep  =  <t>00  -  V*  a^  0qj.  .  (Covariance  Method)  (3-19) 
k=L 

In  the  /  itocorrelation  method  n  ranges  from  -»  to  +“.  Substitut¬ 
ing  (3-13)  in  (3-18)  we  have: 

P 

Ep  =  Rfl  -  a^  Rj,  .  (Autocorrelation  Method)  (3-20) 
k=L 

We  shall  have  the  chance  in  Chapter  V  to  discuss  the  be¬ 
havior  of  this  minimum  error  in  the  Autocorrelation  method  as  a 
function  of  p  and  the  autocorrelation  function.  In  particular, 
we  shall  be  interested  in  the  normalized  error  Vp  defined  by: 


E 

V  -  'p  _  energy  in  the_predictor_error  samples 
p  Rg  energy  InThe  speech~signaT  (3-21) 


V  =  1 


r.i 


-I 


ak  rk 


k=l 


where 


R, 


rk  Rq  '  ^or  ^  > 


(3-22a) 


(3-22b) 
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and  the  samples  will  be  Jcnown  as  the  normalized  autocorrelation 
function,  (Levinson  (1947)  uses  the  notation  V,  Markel  (SCRL  Mon,, 
1571)  uses  n,  and  Atal  and  Hanauer  (1971)  use  e  for  the  normalized 
error.  We  have  chosen  the  letter  V  because  of  the  possible  use¬ 
fulness  of  the  normalized  error  in  th~  indication  of  voicing.) 

Note  that  dividing  (3-15)  by  Rg  and  using  (3-22b)  we  obtain: 

P 

Zak  rU-k|  *  ri  '  15i5P  •  (3'23) 

k=l 

Equation  (3-23)  says  that  the  predictor  coefficients  can  also  be 
computed  using  the  normalized  autocorrelation  samples  r^.  From 
(3-22b)  and  the  fact  that  rk  is  an  autocorrelation  function  we 
have: 

r0  *  1 

and  Ir^l  S  1,  for  all  k.  (3-24) 

The  signal  total  energy  RQ  can  vary  widely  for  different  signals, 
which  might  cause  round-off  problems  in  trying  to  solve  (3-15)  in 
a  digital  computer  with  only  integer  arithmetic  capability. 

Fcr  such  cases  it  would  be  useful  to  normalize  the  autocorrela¬ 
tion  coefficients  first  by  using  (3-22b) ,  and  then  solve  fcr  the 
a^'s  using  (3-23). 
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3.4  Stability  of  Linear  Predictor 

Given  a  frame  of  speech  samples,  the  coefficients  of  the 
linear  predictor  shown  in  Pig.  2-1  are  determined  as  described 
in  Section  1.2,  3.1,  and  3.2.  The  all-pole  transfer  function 

A 

S(z)  is  then  completely  specified  except  for  the  multiplicative 
constant  A,  which  will  be  discussed  in  Section  3.5.  One  impor¬ 
tant  question  now  is  the  stability  of  the  filter  S(z).  This 
can  be  crucial  if  the  recursive  filter  is  to  be  used  for  speech 

A 

synthesis.  We  know  from  Fig.  2-lb  and  (2-6)  that  S(z)  is  reali¬ 
zable.  Therefore,  the  condition  that  S(z)  must  satisfy  for  sta¬ 
bility  is  that  all  the  poles  should  lie  inside  the  unit  circle. 

A 

The  poles  of  S(z)  are  simply  the  roots  of  the  denominator  poly¬ 
nomial  II  (z),  defined  by  (2-3),  which  depend  completely  on  the 
values  of  the  coefficients  a^.  Of  the  three  linear  prediction 
formulations  described  in  Section  1.2,  only  the  Autocorrelation 
method  guarantees  the  stability  of  S(z),  i.e.  for  any  stable 

/v 

signal,  the  poles  of  S(z)  always  lie  inside  the  unit  circle.  [This 
result  is  well  known  from  inverse  filter  theory  and  from  the  theory 
of  orthogonal  polynomials  (see  for  example,  Grenander  and  Szego, 
1958,  pp.  40-41).]  The  implication  for  using  the  predictor  coef¬ 
ficients  in  speech  synthesis  is  clear:  The  coefficients  aj.  can 
be  used  directly  for  synthesis  without  having  to  check  for  the 
stability  of  the  predictive  filter  since  that  is  guaranteed  in 
the  Autocorrelation  method. 


42 


Report  No.  2304 


Bolt  Beranek  and  Newman  Inc. 


In  the  Exact  method  and  Covariance  method  the  stability  of 
§(z)  cannot,  in  general,  be  guaranteed.  However,  in  practical 
situations,  the  stability  of  S(z)  can  be  improved  in  the  Covari¬ 
ance  method  by  increasing  the  number  of  samples  in  the  frame; 
this  is  done  by  increasing  N  since  p  is  normally  fixed.  This  can¬ 
not  be  done  in  the  Exact  method  since  the  number  of  samples  is 
fixed  at  2p  samples.  Atal  and  Hanauer  (1971)  describe  a  method 
for  correcting  the  positions  of  the  poles  which  lie  outside  the 
unit  circle. 

The  above  discussion  assumes  accurate  computation  of  the 
predictor  coefficients  a}, .  For  a  36-bit  computer  with  floating¬ 
point  arithmetic,  this  has  proved  to  be  no  problem.  However, 
for  computers  with  half  as  many  bits  or  less  per  computer  word, 
and  with  integer  arithmetic  capability  only,  round-off  effects 

A 

may  produce  coefficients  which  result  in  an  unstable  S(z),  even 
with  the  Autocorrelation  method  (Markel  and  Gray,  to  be  published). 

3.5  Autocorrelation  Analysis  and  Computation  of  Gain  Factor  A 

There  are  several  ways  to  determine  A,  the  gain  factor  in 
S(z),  depending  on  the  application.  The  criterion  we  shall  use 
in  computing  A  is  the  following:  The  total  energy  in  the  impulse 
response  of  S(z)  must  equal  the  total  energy  in  the  signal  in  the 
frame  of  interest.  TV. is  criterion  is  good  for  speech  recogni¬ 
tion  applications,  but  may  have  to  be  modified  for  vocoder  appli¬ 
cations.  We  shall  determine  the  total  energy  in  the  impulse 
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A  ^ 

response  of  S(z)  from  the  autocorrelation  function  R.  correspond- 
ing  to  the  impulse  response. 

The  impulse  response  is  easily  specified  from  (2-6)  by  set¬ 
ting  sn  =  sn  and  uR  =  6^,  the  input  impulses 

P 

8n  '  Y.  +  A  Sn0  '  '3"25> 

k-1 

( 


1#  n  *  m  , 


where 

6  =  J 

nm  ' 

i 

0,  otherwise  • 

\ 

(3-26) 

Note  from 

(3-25)  that 

A 

sn  =  0 

,  n<0. 

(3-27) 

cn  > 
o 

II 

3* 

‘  9 

(3-28) 

and 

a  =  ^ 

Sn  l 

P 

>  ak  sn-k  *  n"1. 

(3-29) 

k=l 

A 

By  definition,  the  autocorrelation  function  Ri  is  given  by: 

00 

Si  ■  Z  '  for  al1  (3-30) 

n=-°° 

We  know  that  R_^  =  R^  ;  therefore  it  is  sufficient  to  compute 
for  i>0„  From  (3-27)  and  (3-30)  we  have-: 
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a  rn  a 

Ri  "  L  Sn 


sn+i  '  1“°* 


(3-31) 


n-0 


Now,  for  i£lr  n+i£l  in  (3~31).  Therefore,  we  can  substitute  n+i 
for  n  in  (3-29)  and  then  substitute  for  the  resulting  sn+^  in 
(3-31): 

=  E^Iak  ®n+i-k  '  i-1 
n=0  k=l 


=  E‘*E‘n 

k=l  n=0 


s 


n+i-k 


Ri  =  ^ak  R|i-k|  '  lsi<“  * 


(3-32) 


Equation  (3-32)  is  true  for  all  i*0,  RQ  is  determined  from  (3-27) 


through  (3-30)  as  fellows: 

CO 

ES 


R0  * 


n=0 


"2 

=  s0  + 


00  p 

EsE 

k=l  k=l 


ak  sn-k 


=  +  E  =*  E  S: 


in  src+k  * 
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Since  =  0,  hkO  ,  we  have: 

p  00 

~  2  V"1  ~  ~ 

0  k  m  m+k 

k=l  m=0 

P 

Rq  =  A2  +  y  ak  Rk  .  (3-33) 

k=l 

Equations  (3-32)  and  (3-33)  complete.ly  determine  the  autocorrela¬ 
tion  function  of  the  impulse  response  of  S(z). 

A 

Now,  the  total  energy  in  the  impulse  response  of  S(z)  is 
given  by  R^ .  If  we  set  RQ  equal  to  the  total  energy  of  the  sig¬ 
nal,  which  we  will  denote  by  RQ,  then  A  can  be  determined  from 
(3-33)  if  R^. ,  iSkSp,  are  also  known.  Atal  and  Hanauer  (1971, 
p.  653)  describe  a  recursive  method  for  computing  Rj, ,  lsk<p,  from 

A 

(3-32)  with  Rq  normalized  to  1.  (We  assume  here  that  the  coeffi¬ 
cients  a^  are  known.)  As  we  shall  see  in  Section  3.51,  there  is 

A 

a  much  simpler  method  for  computing  in  the  Autocorrelation 
method.  The  only  parameter  that  has  not  been  specified  mathemati¬ 
cally  yet  is  Rq,  the  total  energy  in  the  signal.  In  the  Autocor¬ 
relation  method  this  is  done  simply  by  summing  the  square  of  the 
sample  values  for  all  time.  The  problem  in  the  Exact  and  Covari¬ 
ance  methods  is  to  specify  the  sample  range  whose  total  energy 
is  to  be  computed.  A  reasonable  specification  includes  the  trailing 
p  samples  in  the  Exact  method  and  the  trailing  N  samples  in  the 
Covariance  method. 


46 


Report  No.  2304 


Bolt  Reranek  and  Newman  Inc. 


Note  that  since  (3-32)  is  of  the  same  form  as  (3-15) ,  the  co¬ 
efficients  a^  can  be  uniquely  determined  from  R^,  0<i<p.  Actually, 
for  a  given  A,  there  is  a  one-to-one  relationship  between  the  im¬ 
pulse  response  of  S(z)  (which  is  completely  determined  bv  aj.)  and 
the  corresponding  autocorrelation  function.  We  mentioned  in 
Section  3.4  that  the  stability  of  S(z)  is  guaranteed  if  the  coef¬ 
ficients  aj,  are  computed  from  (3-15).  One  might  conclude  that  the 
stability  of  S(z)  is  automatically  guaranteed  if  the  coefficients 
are  computed  from  (3-32) .  This  is  true  under  one  condition?  that 
the  autocorrelation  coefficients  be  derived  from  a  stable  system. 

In  other  words,  let  us  assume  that  the  coefficients  were  com¬ 
puted  using  the  Exact  or  the  Covariance  method,  and  th<:  t  the  re- 

A 

suiting  S(z)  was  unstable.  Then,  one  could  compete  the  autocor- 

-A 

relation  function  R^  as  mentioned  above.  Solving  for  the  coef¬ 
ficients  again  using  (3-32)  will  give  values  identical  to  the 

'  A 

original  coefficients  and  S(z)  remains  unstable.  The  reason  that 
the  stability  of  S(z)  is  guaranteed  in  the  Autocorrelation  method 
is  that  the  autocorrelation  coefficients  were  derived  from  a 
stable  system,  namely  the  windowed  speech  signal. 

3,51  A  Special  Case:  The  Autocorrelation  Method 

We  already  noted  that  (3-32)  and  (3-15)  are  of  identical  form, 
except  that  in  (3-15)  the  range  of  i  is  limited.  Therefore,  both 
autocorrelation  functions  R^  and  Rj  obey  the  same  matrix  equation 
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(3-17).  From  the  properties  of  (3-17)  we  conclude  that  R^  and 
are  related  by  the  following  equation: 

R±  «  c  Ri  ,  OdiSp  ,  (3-34) 

where  c  is  a  constant  to  be  determined. 

In  order  to  conserve  energy  between  the  impulse  response  of 
S(z)  and  the  actual  signal,  we  must  have  Rq  =  RQ,  as  mentioned 
above.  From  (3-34)  we  conclude  that  c  must  equal  1,  and  we  have 
the  important  result  m  the.  Autocorrelation  method  that: 

R±  =  R±  ,  0<i<p.  (3-35) 

A 

This  says  that  the  first  p  coefficients  (other  than  RQ)  of  the 
autocorrelation  function  corresponding  to  the  approximate  spec¬ 
trum,  as  computed  from  S(z),  are  identical  to  the  first  p  coef¬ 
ficients  of  the  autocorrelation  function  oi  the  actual  signal. 

The  rest  of  the  coefficients  R^  are  determined  by  (3-32) ,  The 
problem  of  linear  prediction  using  the  Autocorrelation  method 
can  be  stated  in  a  new  w ay  as  follows:  Find,  a  transfer  function 
such  that  the  first  p  valuer,  of  its  autocorreiatit  n  function  are 
equal  to  the  first  p  values  of  the  signal  autocorrelation  function, 
and  such  that  (3-32)  applies. 
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Substituting  (3-35)  in  (3-33)  we  have: 

P 

A?'  =  R0  -  Z  \  \  ■ 

k=l 


(3-36) 


The  right-hand  sides  of  (3-36)  and  (3-20)  are  identical. 
Therefore , 


=  R0  VP 


P  n 

I  ak  rk  ' 

k=l 


(3-37) 


2 

and  A  is  equal  to  the  minimum  total-squared  error.  From  (3-37) 
and  (2-2)  we  have: 


S(z) 


VR0 


p 


k=l 


(3-38) 


where  RQ  is  the  total  energy  in  the  signal  and  is  the  normal¬ 
ized  error  defined  by  (3-22). 

The  above  findings  will  be  very  useful  in  discussing  other 
properties  of  the  Autocorrelation  method  in  Chapter  V,  where  we 
shall  analyze  the  properties  of  the  normalized  error  V  and  the 

r 

behavior  of  different  parameters  as  the  number  of  predictor  co¬ 
efficients  p-*"». 
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CHAPTER  IV 

SPECTRAL  ESTIMATION  AND  ANAL YS IS- Ft Y- SYNTIIFS IS 

In  Chapter  III  the  Covariance  and  Autocorrelation  methods 
of  linear  prediction  were  derived  from  a  time-domain  formulation. 
In  this  chapter  we  shall  sho*w  that  the  same  normal  equations  can 
be  derived  from  a  frequency-domain  formulation.  It  will  become 
clear  that  linear  prediction  can  be  considered  equally  validly 
as  either  a  time-domain  or  a  frequency-domain  type  of  analysis. 

First,  the  Autocorrelation  method  is  reinterpreted  in  terms 
of  an  inverse  filter  formulation.  This  leads  directly  to  linear 
prediction  analysis  in  the  frequency  domain.  The  Autocorrela¬ 
tion  method  is  rederived  from  the  spectral  domain  by  approximating 
the  signal  short-time  spectrum  P(w)  by  an  all-pole  power  spectrum 
P(w).  An  error  criterion  between  the  two  spectra  is  defined  and 
minimized.  The  results  are  interpreted  in  terms  of  traditional 
methods  of  spectral  analysis-by-synthesis.  The  Autocorrelation 
method  is  then  reformulated  in  terms  of  a  direct  and  an  indirect 
method  by  relating  to  the  corresponding  methods  of  estimation  of 
power  spectra.  An  analogous  reformulation  of  the  Covariance 
method  is  derived  from  a  generalized  method  of  analysis-b> -syn¬ 
thesis  where  the  signal  is  assumed  to  be  nonstationary  and  the 
two-dimensional  short-time  power  spectrum  Q(urw')  is  to  be 

A 

approximated  by  an  all-pole  two-dimensional  spectrum  Q(w,u)'). 
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A  very  brief  introduction  to  nonstationary  spectral  analysis  is 
included. 


4.1  Inverse  Filter  Formulation 


The  linear  prediction  error  en  was  defined  by  (3-2),  and 


is  repeated  here  for  conveniences 


d«  S  .  . 

k  n-k 


(3-2) 


Since  the  signal  s  is  defined  for  all  time,  then  e^  is  also  de- 
J  n  n 

fined  for  all  time.  Therefore,  we  can  take  the  z-transform  of 
(3-2)  by  multiplying  both  sides  of  the  equation  by  z  n  and  sum¬ 
ming  over  all  n  (see  Appendix  A  for  definition  of  z-transform) . 
The  result  is: 


E(z)  =  S  (z )  (1 


IT 

-  E 


-k, 
ak  z  1 


=  S  (z)  H ( z )  , 


(4-1) 


where  E(z)  and  S(z)  are  the  z-transforms  of  en  and  sn,  respec¬ 
tively,  and  H(z)  =  1-  a^  z  ^  was  already  defined  in  (2-3)  as 
the  inverse  filter. 

From  (4-1)  ,  the  error  signal  e  can  be  interpreted  as  the  output 

I* 

of  a  filter  K(z)  whose  input  is  s ,  as  shown  in  Fig.  4-1. 
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Fig.  4-1.  The  error  sequence  e  as  the  output  of  an  in¬ 
verse  filter  H ( z ) .  n 


Therefore,  another  way  to  view  the  error  minimization  problem  in 
Section  3.1  is  to  solve  for  the  parameters  of  the  inverse  fil¬ 
ter  II  (z)  which  will  minimize  the  energy ^e2  in  the  output  error 

n 

signal,  for  a  given  value  of  p.  This  is  what  Markel  calls  the 
inverse  filtei  formulation  (.Markel,  1972). 

Equation  (4-1)  can  be  solved  for  S(z)  to  obtain: 


S  ( 7 )  -  - 

bm  H  (z ) 


E(z) 


(4-2) 


A. 


1  -  >  a,,  z 
k=l 


-k 


(4-2)  is  an  exact  equation.  According  to  the  speech  production 
model  described  in  Section  2.1,  if  the  signal  sn  is  the  vocal 
tract  response  due  to  a  single  pitch  pulse,  then  the  transfer 

A 

function  S(z)  can  be  approximated  by  an  all-pole  filter  f  (z)  gi¬ 
ven  by  (2-2)  and  shown  below: 


S(z) 


_ A_ 

H  (z ) 


A 


P 


k=l 


(2-2) 
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Comparing  (2-2)  and  (4-2)  we  conclude  that  E(z)  is  approximated 
by  another  function 

E(z)  =  A  , 

A 

which  corresponds  to  a  time-domain  approximation  en  given  by: 

e  =  A  6  _  ,  (4-3) 

n  nO 

where  6  is  the  Kronecker  delta  defined  by  (3-26). 
nm 

en  is  just  an  impulse  of  magnitude  A.  Now,  in  order  to  conserve 

energy  between  e  and  e  we  must  have 

n  n 


n--“  n=-°° 


After  the  minimization  of  the  total-squared  error,  the  right- 
hand  side  of  (4-4)  is  equal  to  the  minimum  total-squared  error 
E  given  by  (3-20).  The  left-hand  side  of  (4-4)  is  determined 

r 

easily  from  (4-3) ,  and  we  have: 

P 

a2  =  ep  =  ro  ■  Y,  a •< R*  • 

k=l 

The  result  is  identical  to  (3-37)  which  was  derived  by  energy 
conservation  between  the  signal  s^  and  the  impulse  response  of 
S(z) . 

The  above  analysis  assumed  that  the  vocal  tract  was  excited 
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by  a  single  pulse.  The  same  results  would  be  obtained  if  one 
assumed  a  white  noise  source  excitation. 


4 . 2  Error  Minimization  in  the  Spectral  Domain 

In  this  section  we  shall  show  that  the  Autocorrelation  nor¬ 
mal  equations  (3-15)  can  also  be  derived  completely  in  the  fre¬ 
quency  domain.  Before  we  proceed,  we  shall  define  the  power  spec¬ 
trum  of  a  transfer  function  Y(z)  as  the  magnitude  squared  of  Y(z) 
evaluated  on  the  unit  circle,  i.e.  z  =  e-^7.  Y(z)  evaluated  at 
z  =  -5uT  will  be  denoted  by  Y(w),  so  that  the  power  spectrum  is 
given  by: 


Power  Spectrum  =  Y(.'d)  Y(w)  (4-5) 

=  |YU)  I2  , 

where  the  over-bar  denotes  complex  conjugate. 

Let  the  power  spectrum  of  S(z)  be  denoted  by  P(u>),  and  of  S(z)  by 
P(w) ,  then: 


and 


(4-6a) 


( 4— 6b ) 


We  shall  call  P(u)  the  linear  prediction  or  approximate  spectrum 
and  P  (w)  the  actual  or  signal  spectrum.  Methods  for  computing 
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P(u)  and  P^w)  are  given  in  Appendix  C. 

Making  use  of  Parseval's  theorem  (see  Appendix  A),  the 
total-squared  error  E  can  be  represented  by: 


(4-7) 


where  Pe(w)  is  the  error  power  spectrum. 

From  linear  system  theory,  we  have  from  Fig.  4-1: 


Pe(w)  =  P(w)  |  H  (cd)  | 2  ,  (4-8) 

where  H(w)  is  equal  to  II  (z)  evaluated  for  z  =  e^u  . 

Substituting  (4-8)  in  (4-7)  we  have: 


Following  the  same  procedure  in  Section  3.1,  E  is  minimized  by 
3E 

setting  — =  0,  lSiSp  : 
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9E 

“SaT 


v?  r  ,  p  \  /  p 

-  =  j  p<“>  -e'ji“T  l-Va^M  -eii“T  1-V 

1  -tt/T  .  \  k=*l  I  \  k=L 


ake  ^  |  du>=0 


y  p 

|r  /  P(u>)  cos (iwT)  “Y'a,  cos{  (i-k)ojT}  < 

-»/T  L  k=L  J 


Interchanging  integration  and  summation  we  have: 


p  r  tr/-] 

Zak  5?  / 

k=l  L  "IT /I 


P(w)  cos{(i~ 


i-k)wT}du)  =  ^  J 
-it /I 


P(w)  cos(iuT)  du,  l<i<p. 


(4-10) 


We  know  that  the  autocorrelation  function  R(kT)  is  defined  as 
the  inverse  Fourier  transform  of  the  power  spectrum,  i.e. 


V 

-h  f 


P(u>)  dw. 


(4-lla) 


-tt/T 


/  j 

=  T-  I 

Jr  /n 


P  (oj)  cos  (kwT)  du>. 


(4-llb; 


-tt/T 


(4-llb)  follows  from  (4-lla)  because  the  power  spectrum  is  a 
real  and  even  function  of  frequency.  Substituting  (4-llb)  in 
(4-10)  and  noting  that  R_^  =  R^,  we  have: 


ak  R|i-k|  =  Ri  '  lsisP 


(4-12) 
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which  are  the  same  Autocorrelation  normal  equations  as  (3-15). 


(4-10) 


The  minimum  total- squared  error  can  be  obtained  by  using 
and  (4-11)  in  (4-9) .  The  answer  can  be  shown  to  be  equal 


to 


(4-13) 


which  is  identical  to  that  given  in  (3-20)  and  (3-37). 

The  above  derivation  shows  that,  in  the  Autocorrelation 
method,  the  predictor  parameters  a^  can  be  determined  if  only 
the  signal  power  spectrum  is  known.  In  fact  all  that  is  needed 
are  the  first  p  coefficients  of  the  autocorrelation  function, 
which  can  be  computed  either  from  the  time  signal  (Section  3.1) 
or  from  the  power  spectrum  as  was  shown  above.  The  latter  state¬ 
ment  will  be  the  basis  for  other  formulations  of  the  Autocorre¬ 
lation  method  which  are  based  on  the  idea  of  estimating  the  first 
p  values  of  the  autocorrelation  function  (see  Section  4.4). 

4 . 3  The  Spectral  Envelope  and  Analysis-by-Synthesis 

We  shall  now  interpret  the  minimization  of  error  in  the 


Autocorrelation  method  in  terms  of  the  estimation  of  the  spec¬ 
tral  envelope  and  in  terms  of  analysis-by-synthesis. 
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Prom  (2-2),  H(x)  can  he  written  as: 


H(z)  = 


S  (z) 


and 


HO#)  =  ■— 


(&>) 


(4-14) 


Substituting  (4-14)  in  (4-9)  we  obtain: 


E  = 


2 

A  T  f  p(ui) 

“sr  J 

-tt/T 


dto 


a  (w) 


(4-15) 


( S  (a) )  |  ^  is  the  approximate  power  spectrum  P(u>)  as  defined  in 
(4- 6a) ,  and  (4-15)  reduces  to: 


E  = 


2tt 


2 

_t  jf  p(w) 


dw. 


(4-16) 


P(u>) 

-if  /T 


Therefore,  minimizing  the  total-squared  error  E  is  equivalent 
to  the  minimization  of  the  integrated  ratio  of  the  signal  power 

/v 

spectrum  P(w)  to  its  approximation  P(u>).  Another  way  to  look  at 
this  is  that  if  one  is  interested  in  anproximating  a  power  spec- 

A. 

trum  P(w)  by  an  all-pole  spectrum  P(w)  then  (4-16)  is  an  error 
measure  that  can  be  used  in  optimizing  the  approximation.  We  al¬ 
ready  know  that  this  error  can  be  minimized  analytically  resulting 
in  the  Autocorrelation  normal  equations  (4-12)  which  can  be  solved 
for  a^,  the  parameters  of  the  sought-for  approximate  spectrum  P(u>). 
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The  question,  then,  is  what  are  the  properties  of  the  error  mea¬ 
sure  in  (4-16),  and  are  these  properties  commensurate  with  our 
stated  goals?  This  is  discussed  below. 

The  model  of  speech  production  described  in  Chapter  II 
approximates  the  transfer  function  of  the  glottal  flow,  the  vocal 
tract  and  radiation  by  a  single  all-pole  filter  £(z)  which  is 
excited  by  a  combination  of  sequences  of  impulses  and  white  noise. 

A 

Due  to  the  nature  of  the  excitation  we  conclude  that  P(iu)  attempts 
to  approximate  the  envelope  of  the  signal  power  spectrum  P (m) . 

One  important  consideration  in  estimating  the  spectral  envelope 
is  the  determination  of  an  optimum  value  for  p,  the  number  of 
poles  in  the  all-pole  approximate  spectrum  P(co).  This  subject  is 
discussed  in  Section  5.6.  However,  assuming  that  somehow  we 
know  this  optimal  value  of  p,  there  remains  the  question  of  whe¬ 
ther  the  error  measure  in  (4-16)  will  result  in  a  good  estimate 
of  the  spectral  envelope.  We  note  from  (4-16)  that  spectral 
values  of  P(w)  that  are  greater  than  the  corresponding  values 

A 

in  P(m)  will  contribute  to  the  total  error  in  a  significant  man¬ 
ner,  while  spectral  values  of  P  (to)  that  are  much  smaller  than 

A 

the  corresponding  values  in  P  (u>)  will  not  affect  the  total  error 
significantly.  This  means  that,  after  the  minimization  of  error, 

A 

we  expect  a  better  fit  of  P(w)  to  P («)  where  P(w)  is  greater 
than  P(oj)  than  where  P(to)  is  smaller.  For  example,  if  P(to)  is 
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the  power  spectrum  of  a  quasi-periodic  signal  (such  as  a  sonorant) , 
then  most  of  the  energy  in  P(w)  will  exist  at  the  harmonics  and 
very  little  energy  will  reside  between  harmonica.  The  emr  mea¬ 
sure  in  (4-16)  insures  that  the  approximation  of  £(0)  to  P(u)  is 
far  superior  at  the  harmonics  where  the  energy  is  greater,  than 
between  the  harmonics  where  there  is  very  little  energy.  Since 

A 

P(w)  is  expected  to  be  a  smooth  spectrum  (this  is  insured  by 
choosing  an  appropriate  value  for  p) ,  we  conclude  that  minimiza¬ 
tion  of  the  error  measure  in  (4-16)  results  in  an  approximate 
spectrum  P  (u> )  that  is  a  good  estimate  of  the  spectral  envelope 
of  the  signal  power  spectrum  P(w).  It  should  be  clear  from  the 
above  that  the  importance  of  the  goodness  of  the  error  measure 
is  much  more  crucial  for  voiced  sounds  than  for  unvoiced  sounds 
where  the  variations  of  the  signal  spectrum  from  the  spectral 
envelope  are  much  less  pronounced. 

Another  important  property  of  this  estimation  procedure  is 
that,  because  the  contributions  to  the  total  error  are  determined 
by  the  ratio  of  the  two  spectra,  the  matching  process  should  per¬ 
form  uniformly  over  the  frequency  range  of  interest,  irrespective 
of  the  shaping  of  the  speech  spectral  envelope.  This  property  is 
reminiscent  of  the  analysis-by-synthesis  method  of  spectral  re¬ 
duction  developed  at  M.I.T.  (Bell,  efc.al.,  1961),  and  was  used 
by  Paul  et  al.  (1964)  for  the  automatic  reduction  of  vowel  spec¬ 
tra,  and  by  Fujimura  (1962)  for  the  analysis  of  nasal  consonants. 
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A  recent  improvement  in  convergence  strategy  was  introduced  by 
Olive  (1971)  using  a  Newtou-Raphson  technique.  Also,  a  pitch- 
synchronous  analysis-by-synthesis  was  developed  by  Mathews  et  al. 
in  1961.  The  general  idea  behind  the  reduction  of  spectra  using 
analysis-by-synthesis  is  that  one  has  a  spectral  model  consisting 
of  poles  and  zeros,  and  the  problem  is  to  vary  the  positions  of 
these  poles  and  zeros  such  that  some  error  criterion  between  the 
model  spectrum  and  the  signal  spectrum  is  minimized.  The  error 
measure  that  was  normally  used  is  given  (in  our  notation)  by: 


2 

ace,  (4-17) 

where  W(w)  is  a  weighting  function,  P(u>)  is  the  model  spectrum, 
and  the  integration  is  over  the  frequency  range  of  interest.  In 
r_ny  cases  the  weighting  function  W(w)  was  set  equal  to  1,  and 
the  integration  was  always  approximated  by  a  summation  over  dis¬ 
crete  frequencies.  The  positions  of  poles  and  zeros  of  P(u)) 
were  varied  ;ch  that  the-  error  E'  was  minimized. 

It  is  *  that  the  Autocorrelation  method  of  linear  pre¬ 
diction  can  .  wed  as  a  method  of  analysis-by-synthesis  where 

A 

the  model  spectrum  P(a}}  consists  of  poles  only  and  the  error  mea¬ 
sure  is  given  by  *4-1,6).  The  error  measures  in  (4-16)  and  (^-17) 
are  similar  in  that  the  contributions  to  the  total  error  are 


fw(co) 

J 

0) 

P(u) 
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proportions]  to  the  ratio  of  the  tv/o  specta.  We  have  already 
mentioned  that  this  fact  makes  the  matching  process  perform  uni¬ 
formly  over  the  frequency  range  of  interest  (assuming  W(u)  in 
(4-17)  to  be  constant).  However,  the  error  measure  F,  in  linear 
prediction  ha-;  two  advantages  over  E':  (1)  The  minimization  of 

A 

E  in  (4-16)  can  be  done  analytically  and  the  resulting  P(w)  is 
computed  simply  by  solving  a  set  of  simultaneous  linear  equations, 
while  the  minimization  of  E*  has  to  be  done  iteratively  and  also 
approximately  in  that  a  summation  is  used  instead  of  an  integra¬ 
tion.  (2)  E  is  a  superior  error  measure  to  E'  if  a  spectral  en¬ 
velope  is  desired.  This  is  clear  if  you  note  from  (4-17)  that 
contributions  to  the  total  error  E'  are  made  equally  whether 
P(u)>P(o)  or  P  (w)<P(u)  ,  which  moans  that  energy  at  the  harmonics 
(in  voiced  sounds)  and  the  lack  of  energy  between  harmonics  con¬ 
tribute  equally  the  total  error.  This,  of  course,  will  not 
lead  to  a  good  spectral  envelope.  But  then,  traditional  ana¬ 
lysis-by-synthesis  methods  have  generally  used  already  smoothed 
spectra,  in  which  case  it  is  probably  of  little  consequence  which 
error  measure  is  used.  The  elegance  of  the  linear  prediction 
method  is  that  it  performs  the  smoothing  (for  a  well-chosen  p) 
as  well  as  the  analysis-by-synthesis  type  of  computation  all  at 
once  by  simply  solving  a  set  of  simultaneous  linear  equations. 

The  price  that  one  has  to  pay  is  that  the  approximate  spectrum 
P(oj)  can  have  only  poles. 
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By  virtue  of  the  above  properties  of  linear  prediction,  it 
follows  that  any  smoothing  of  the  signal  spectrum  before  the 
application  of  linear  prediction  is  not  only  a  waste  of  time,  but 
may  also  introduce  errors  in  the  estimation  of  the  predictor  para¬ 
meters.  For  example,  preprocessing  the  speech  signal  by  homo¬ 
morphic  analysis  (Weinstein  and  Oppenheim,  1971)  is  unnecessary 
if  one  is  interested  in  using  linear  prediction;  better  results 
would  be  obtained  by  using  linear  prediction  on  the  original 
signal. 

Figure  4-2  shows  an  example  of  the  Autocorrelation  method 
of  analysis  performed  on  a  25  msec  portion  of  the  vowel  [as]  in 
the  word  "potassium".  A  Hamming  window  was  used  on  the  signal 
and  the  predictor  had  14  poles ,  P(u>)  seems  to  be  a  good  esti¬ 
mate  of  the  spectral  envelope  of  the  signal  power  spectrum  P(w)* 
(See  Appendix  C  for  methods  of  computing  P(w)  and  P(w).) 

4.4  Reformulation  of  the  Autocorrelation  Method 

We  have  shown  above  that  the  Autocorrelation  method  of  li¬ 
near  prediction  can  be  viewed  as  a  process  of  spectral  matching 

or  approximation,  where  the  envelope  of  the  signal  power  spec- 

/. 

tram  P(w)  is  approximated  by  an  all-pole  power  spectrum  P(w)  gi¬ 
ven  by  (4-6a) ,  and  the  error  measure  to  be  minimized  is  given 
by  (4-16).  So  far  in  this  report  we  have  assumed  P(w)  to  be  a 
short-time  spectrum  obtained  by  taking  the  power  spectrum  of  a 


63 


- - - 


"  KA>t  IS.  ,^7^^" 


Report  No. 


2304 


^s?7VjrxT  -  rs&^^e^rr 


Bolt  Beranek  and  Newman  Inc. 


1 


% 

% 


I 

I 

I 


windowed  signal.  However,  there  is  nothing  in  this  chapter  that 
restricts  P(o)  to  be  defined  in  that  particular  manner.  In  qe- 
neral,  there  are  two  basic  methods  for  the  estimation  of  the  po¬ 
wer  spectrum  from  a  knowledge  of  a  finite  portion  of  a  stationary 
signal  (see  Blackman  and  Tukey,  1958): 


1.  Direct  Method  -  The  power  spectrum  is  estimated  by: 


P(u>)  = 


N-l 


^w(nT)  s(nT)  e“^na)T 
n=0 


(4-18) 


where  s(nT)  is  the  original  signal  whose  power  spectrum  is  desired, 
and  w(nT)  is  a  window  function  that  is  defined  to  be  zero  for 
n<0  and  n>N.  (A  discussion  of  window  functions  is  given  in  Sec- 
cion  6.2.)  The  spectrum  defined  by  (4-18)  is  also  known  as  the 
short-time  spectrum,  and  it  is  the  method  we  have  used  thus  far 
to  estimate  the  power  spectrum  of  a  short  portion  of  the  signal. 


2.  Indirect  Method  -  The  estimated  power  spectrum  is  com¬ 
puted  as  the  Fourier  series  of  a  windowed  apparent  autocorrela¬ 
tion  function: 

M 

P(»>  =  £  D(kT)  R(kT)  e"^7  ,  (4-19) 

k=-M 

where  D(kT)  is  an  even  window  defined  to  be  zero  for  |k|>M,  and 
R(kT)  is  the  apparent  autocorrelation  function,  which  is  com¬ 
puted  from  the  signal.  The  word  "apparent”  is  used  to  indicate 
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that  &{kT)  is  not  a  true  autocorrelation  function  since  it  is 
defined  over  a  finite  portion  of  the  signal.  We  shall  give  two 
methods  for  the  computation  of  R(kT),  yielding  functions  which 
will  be  labelled  and  Rj|2^  : 

N-l- | k | 

~ (1)  N  t-’ 

(a)  Rk  =  tTTf  L,  Sn  Sn+|k|  ,  |k|<M.  (4-20) 

n=0 

N-l 

(b)  Rk.2)  =  Yj  Sn  Sn+|k|  ,  |k|sM.  (4-21) 

n=0 

In  (4-20)  the  signal  s  (nT)  is  assumed  to  be  known  for  N  consecu¬ 
tive  samples  while  in  (4-21)  s (nT)  is  assumed  to  be  known  for 
N+M  samples.  The  signal  is  undefined  outside  these  ranges.  Note 
that  we  must  have  M<N,  and  for  a  stable  spectral  estimate  of  a 
noisy  signal,  M  is  usually  taken  to  be  a  small  fraction  of  N.  See 
Blackman  and  Tukey  (1958)  for  a  thorough  analysis  of  this  sub¬ 
ject. 

Sometimes  a  single  estimate  of  the  power  spectrum  as  des¬ 
cribed  above  may  not  be  stable  enough,  i.e.  the  variability  of 
the  estimate  with  respect  to  the  "true"  spectrum  is  large.  The 
stability  can  be  improved  (with  a  corresponding  decrease  in  fre¬ 
quency  resolution)  by  averaging  over  several  estimates  of  the 
power  spectrum  taken  over  several  (possibly  overlapping)  portions 
of  the  signal.  The  averaging  can  be  alternately  performed  on 
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the  autocorrelation  function.  One  must  be  c  ful,  however, 
that  the  basic  assumption  of  stationarity  still  holds  for  the 
total  signal  span  whose  power  spectrum  is  being  estimated. 

In  speech  research,  the  direct  method  of  spectral  analysis 
has  been  used  almost  exclusively.  The  method  is  computationallv 
efficient  and  has  proved  to  be  quite  adequate  for  many  speech 
applications.  Using  the  indirect  method  for  computing  the  power 
spectrum  is  relatively  .inefficient,  and  may  not  be  cost-effec¬ 
tive  for  many  applications . 

Having  computed  the  estimated  signal  power  spectrum  P(w) 
by  one  of  the  methods  described  above,  we  can  compute  the  para¬ 
meters  of  the  approximate  power  spectrum  P(w)  from  the  Autocorre¬ 
lation  normal  equations  (4-12),  where  s  autocorrelation  coeffi¬ 
cients  R^  are  computed  from  P(w)  by  using  (4-11).  But  if  the 

coefficients  R^  can  be  computed  directly  from  the  time  signal 
there  is  no  need  to  estimate  P(u>)  in  the  first  place.  Indeed, 
using  the  direct  method,  we  have  already  shown  how  to  compute 
R^  from  the  windowed  signal  (see  (3-16)).  In  the  indirect  method, 
from  (4-19),  the  coefficients  are  equal  to: 

R^  =  R^  ,  (Indirect  Method)  (4-22) 

where  R^  is  either  equal  to  R^^  in  (4-20)  ^x  to  in  (4-21). 

The  introduction  of  an  autocorrelation  window  may  produce 
some  distortion  in  estimating  R^,.  One  method  of  avoiding  the 
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U3e  of  such  a  window  is  to  let  be  the  average  of  several 
values  of  R^  computed  from  overlapping  portions  of  the  signal. 


If  we  replace  s(nT)  by  s(nT+iT)  in  (4-20)  and  (4-21),  we  can  say 
that  Rj^  and  are  functions  of  time  t  »  iT,  and  they  can 

be  denoted  by  R^iT)  and  Rj[2Ut)  .  Similarly  at  time  t  =  iT 
will  be  denoted  by  R^(iT).  The  index  i  can  be  varied  and  the  re¬ 


sulting  values  of  the  apparent  autocorrelation  can  be  averaged, 


yielding  an  estimated  R^.  This  can  be  written  as: 


M-l 

\  =  4  I  K  (iT>  •  (4-23> 

i=0 

Alternatively,  the  number  of  values  averaged  could  be  made  to 
depend  on  the  index  k  of  R^.  Thus, 

n-i- | k | 

Rk  =  M-TiTf  E  ViT)  '  M>k»  °-k-P-  (4~24) 

i=0 

In  ^ 4— 24 )  more  values  are  used  in  computing  for  low  values  of 
k  than  for  large  values  of  k.  This  is  not  unreasonable  since 
the  low-order  autocorrelation  coefficients  are  more  important 
in  determining  the  general  shape  of  the  spectrum,  and  therefore 
their  values  should  be  more  "accurate"  or  stable. 

The  definitions  for  R}.  given  by  (4-20)  and  (4-21)  are  only 
two  of  several  possible  definitions.  For  example,  two  other 
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similar  definitions  are  obtained  by  inversion  of  the  time  axis. 
This  is  done  by  substituting  the  index  (n-jkj)  for  (n+|k|)  in 
(4-20)  and  (4-21).  Also,  (iT)  would  be  obtained  ^placing 

s(nT)  by  s(nT-iT)  in  (4-20)  and  (4-21).  In  that  case,  R,  in 
(4-21)  becomes  equal  to 


“  'f’nv-  '  Osksp  , 


where  are  the  covariance  coefficients  defined  in  (3-9).  In 
fact,  if  we  substitute  for  R^  in  the  equation  for  the  minimum 
total-squared  error  in  (3-20),  then  (3-19)  and  (3-20)  become  iden¬ 
tical.  Also,  (4-24)  for  M  =  p+1  reduces  to: 

p-|k|  N-l 

Rk  "  p+1- | k |  Y  I]  Sn-i  sn-i-|k| 

i=0  n=0 


P-|k| 


RkT  Y 


i,i+k 


,  0<k<p  , 


(4-25) 


which  is  the  average  of  the  covariance  coefficients  along  each  of 
the  diagonals  in  the  covariance  matrix  <J>^k  (including  the  vector 
<J>0^) .  One  way  to  look  at  the  operation  in  (4-25)  is  that  it  is 
averaging  out  the  nonstationarity  inherent  in  the  covariance  ma¬ 
trix  (see  Section  4.6),  resulting  in  a  stationary  autocorre¬ 

lation  matrix.  As  we  shall  see  belov/,  the  Covariance  method  and 
the  indirect  formulation  of  the  Autocorrelation  method  share  the 


property  that  the  stability  of  the  linear  predictor  cannot  be 


Report  No.  2304 


Bolt  Beranek  and  Newman  Inc 


guaranteed. 

Henceforth,  we  shall  talk  about  the  direct  or  indirect 
Autocorrelation  method  as  referring  to  whether  the  coefficients 
are  computed  from  a  windowed  signal  or  from  an  apparent  auto¬ 
correlation  function  R^,  respectively.  Note  that  although  the 
indirect  method  may  be  inefficient  for  computation  of  the  power 
spectrum,  the  same  is  not  true  for  the  computation  of  (p+1) 
values  of  R^. 

4.41  Stability  of  Linear  Predictor 

In  Section  3.4  we  stated  that  of  the  different  formulations 
of  linear  prediction,  only  the  Autocorrelation  method  guarantees 
the  stability  of  the  linear  predictor,  i.e.  all  the  poles  of  S(z) 
are  inside  the  unit  circle.  This  statement  must  be  amended  now 
to  read:  only  the  direct  Autocorrelation  method  guarantees  the 
stability  of  the  linear  predictor.  The  reason  for  this  restric¬ 
tion  is  that  the  coefficients  are  guaranteed  to  be  those  of 
an  autocorrelation  function  only  in  the  direct  method.  In  th j 
indirect  method,  the  coefficients  are  only  estimates  of  some 
autocorrelation  function,  as  can  be  seen  from  (4-20)  to  (4-24). 
These  estimates  may  or  may  not  form  part  of  an  autocorrelation 
function.  In  order  for  the  coefficients  to  be  those  of  an 
autocorrelation  function  t  ley  must  form  a  set  that  is  positive- 
definite  (Papoulis,  1965,  p.  349).  More  formally,  given  an 
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arbitrary  set  of  constants  u^,  02kSp,  the  coefficients 
0<|k):Sp,  form  a  positive-definite  set  if  and  only  if  the  following 
condition  holds  (Papoulis,  1965,  p.  349;  Grenander  and  Szego, 

1958,  pp.  17-19) : 


T. 

1 


A  JL 

EE« 

n=0  m=0 


r-  u_  £0 


n-m  n  m 


02i<p  , 


(4-26) 


where  T.  ,  OSiSp,  are  known  as  Toeplitz  forms,  and  the  ovei-bar 
denotes  complex  conjugate. 

In  particular,  (4-26)  should  be  true  fcr  i  *  p,  and  for  the  con¬ 
stants  u,  equal  to  the  impulse  response  of  the  inverse  filter 


P  _k 

H(z)  ■  l-^£  a,,  z  .  Let 


uk  =  < 


1  ,  k=*0, 

-ak,  l<k<p  , 


(4-27) 


Substituting  (4-27)  in  (4-26) ; 

P  P 

Tp  “  R0  RmanT  ^  an  JRn-^  F'n-marc 


m=l  n=l 


P 

r 

m«l 


But  the  terms  in  square  brackets  are  zero,  due  to  th j  Autocorre¬ 
lation  normal  equations  (4-12) . 

Hence , 

Tp  =  R0  -  t**  **  '  Ep  *  °'  u"28) 

k=l 
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and  the  Toeplitz  form  is  equal  to  the  minimum  total-squared 
error  E  which  must  be  greciter  or  equal  to  zero.  Although  (4-28) 

r 

is  a  special  case  of  (4-26),  it  can  be  shown  tnat  (4-28)  is  a 
necessary  and  sufficient  condition  for  the  set  of  coefficients 

a 

to  be  positive-definite,  and  hence  result  in  a  stable  S(z) 

(see  Appendix  B) .  Therefore,  in  order  to  test  for  the  stability 
of  the  linear  predictor,  given  a  set  of  coefficients  R^:  Com¬ 
pute  the  predictor  parameters  a,  from  (3-17)  and  check  for  the 
condition  (4-28). 

Another  method  to  check  for  the  positive-definiteness  of 
the  coefficients  R^  is  to  make  sure  that  the  corresponding  power 
spectrum  is  nonnegative  for  all  frequencies  (Papoulis,  1965, 
p.  349).  But  in  order  to  do  that,  R^  must  be  defined  for  all  k. 
Such  a  definition  can  be  arbitrary  for  |k|>p.  A  convenient  way 
of  extending  R,  is  to  make  it  periodic  with  period  2p,  i.e. 


Rk+2p  ~ 


(4-29) 


We  can  now  apply  the  discrete  Fourier  transform  (Gold  and  Rader, 
1969,  p.  162)  to  R.  and  obtain  the  discrete  power  spectrum  P(nw0): 

K 


where 


2p-l 

P(nM.)  =  £  e-5kn"»T  , 
k=0 

2tt 

u°  ~  SpT  * 


(4-30) 
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Since  Rj,  is  discrete,  real,  even  and  periodic  in  2p,  P(noj0)  is 
also  discrete,  real,  even  and  periodic  in  2p.  Therefore,  it  is 
only  necessary  to  compute  n+1  values  of  P(nu>0),  e.g.  Osn^n.  If 
these  values  of  P(nu)0)  are  all  greater  or  equal  to  zero,  we  con¬ 
clude  that  the  set  of  coefficients  is  positive-definite  and 
that  C(z)  will  be  stable. 

Suppose  now  that  we  have  used  one  of  the  above  methods  (or 
any  other  method)  to  check  for  the  stability  of  S(z)  and  found 
it  to  be  unstable.  The  problem  is  what  to  dc.  about  the  coeffi¬ 
cients  to  improve  the  stability  of  S(z).  One  method  is  to  use 
a  window  as  shown  in  (4-22).  The  narrower  the  effective  win¬ 
dow  widch,  the  more  stable  S(z;  is  likely  to  be.  A  superior  and 
highly  recommended  method  is  to  take  the  average  of  R^  for  se¬ 
veral  overlapping  portions  of  the  signal,  as  shown  in  (4-23)  and 

(4-24).  Increasing  the  value  of  M  in  those  equations  increases 

\ 

the  stability  of  S(z).  A  value  of  M<¥p  is  usually  sufficient. 

Note  that  the  methods  that  have  been  suggested  for  improving 
the  stability  of  the  linear  predictor  have  the  side  effect  of  de¬ 
creasing  the  frequency  resolution  in  the  corresponding  power  spec¬ 
trum.  Indeed,  in  the  direct  Autocorrelation  method,  the  stability 
of  the  linear  predictor  is  guaranteed  by  multiplying  the  speech 
signal  s (nT)  1  -  a  finite  window:  a  process  that  results  in  loss 
of  resolution  in  the  signal  power  spectrum.  However,  for  mcst 
applications  this  less  of  resolution  is  not  critical. 
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4 . 5  Nonstationary  Spectral  Analysis  > 

So  far  in  this  chapter  we  have  discussed  the  spectral  ana-  j 

lysis  of  speech  by  means  of  the  Autocorrelation  method  of  linear 
prediction.  The  main  assumption  underlying  the  whole  discussion  ;J 

was  that  the  predictor  coefficients  a^,  l<k£p,  are  computed  from  j 

i 

a  portion  of  the  signal  that  can  be  considered  as  stationary.  In 

the  direct  method,  this  stationarity  was  enforced  by  windowing  1  j 

the  speech  signal  and  considering  the  resulting  infinite  signal 

) 

which  has  a  well-defined,  time-independent  power  spectrum  and  auto-  ! 

correlation.  In  the  indirect  method,  stationarity  was  enforced  , 

by  assuming  first  that  (3-17)  holus,  and  then  proceeding  to  esti¬ 
mate  the  autocorrelation  coefficients.  The  averaging  operations 
in  (4-23)  and  (4-24)  are  only  valid  under  the  assumption  of  sta¬ 
tionarity. 

As  we  shall  see  in  this  section,  the  Covariance  method  as¬ 
sumes  that  the  portion  of  the  signal  from  which  the  predictor 
parameters  are  computed  is  nonstationary  .  It  should  be  made  clear 
that  we  are  not  discussing  the  stationarity  of  the  running  speech 
signal  as  such,  but  rather  the  stationarity  of  a  single  frame 
from  which  we  wish  to  compute  the  predictor  parameters.  Both  the  } 

Covariance  and  the  Autocorrelation  methods  assume  that  the  run¬ 
ning  speech  signal  is  nonstationary.  This  is  evident  b*  the  fact 
that  the  predictor  parameters  change  from  one  frame  to  the  next. 


74 


Report  No.  2304 


Rolt  Beranek  and  Newman  Inc 


as  was  assumed  in  the  model  for  speech  production  in  Chapter  II. 
However,  within  a  single  frame,  the  Autocorrelation  method  assumes 
that  the  signal  is  stationary  while  the  Covariance  method  assumes 
•chat  the  signal  is  nonstationary. 

Just  as  in  Section  4.2  we  derived  the  Autocorrelation  nor¬ 
mal  equations  in  the  frequency  domain,  we  shall  do  the  same  to 
derive  the  Covariance  normal  equations.  The  only  difference  is 
that  here  we  shall  assume  the  signal  to  be  nonstationary,  in  which 
cast  the  power  spr.ci.rum  is  a  function  of  time,  however,  before  we 
do  the  derivation  we  shall  give  some  background  information  on 
spectral  analysis  of  nonstationary  signals.  For  references  on 
the  subject  see,  for  example,  Papoulis  (1965,  Ch.  12)  and  Bendst 
and  Pierson  (1966,  Ch.  9). 

The  autocorrelation  R(t,t')  of  a  nonstationary  process  is 
a  function  of  two  time  variables  t  and  t'.  A  stationary  process 
is  then  a  special  case  where  the  autocorrelation  becomes  a  func¬ 
tion  of  omy  the  time  lag  t’-t,  i.e.  R(t*-t).  If  we  let 

x  =  t'-t  (4-31) 

be  the  time  lag,  then  R(t'-t)  =  R (  )  for  a  stationary  process, 
and  R (t , t ' }  -  R{t,t+x)  for  a  nonstationary  process.  Here  wc  shal) 
assume  that  t,  t'  and  t  take  on  discrete  values  only.  For  example, 
if  we  let  x  =  kT,  then  R(kT)  would  be  an  autocorrelation  function 
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which  v/e  have  seen  repeatedly  .in  this  chapter. 

The  power  spectrun  of  a  nonstationary  discrete  process 
is  defined  as  the  Fourier  series  transform  of  the  autocorrela¬ 
tion  R(t..  t+T): 

CO 

P(w,t)  =  Y  R  (t  / 1+  t  )  e-;ia)T.  (4-32) 

T=-m 

Note  that  the  spectrum  P(w,t)  is  a  function  of  time  t.  For  a 
stationary  process  the  autocorrelation  is  a  function  of  t  only, 
and  from  (4-32)  we  see  that  the  power  spectrum  becomes  P(w), 
which  is  time- independent.  In  speech  analysis,  P(w,t.)  can  be 
viewed  as  the  running  short-time  spectrum  (e.g.  such  as  a  spec¬ 
trograph  might  produce).  However,  what  is  important  in  the  Co- 
variance  method  is  that  we  wish  to  consider  the  spectrum  P(w,t) 
to  be  changing  in  time  within  a  single  frame  of  the  signal,  and 
that  v/e  wish  to  represent  this  change  in  some  manner.  This  can 
be  done  by  taking  the  Fourier  transform  of  P(u>.t)  with  respect 
to  time  t.  The  result  is  a  frequency  correL  '  ,  function  which 
is  the  generalized  (nonstationary)  spectrum.  It  is  defined  by: 

CO 

r(w,Q)  =2^  P  (w,t)e  (4-33) 

t=-» 

r(w,n)  is  also  known  as  a  double  frequency  spectrum.  Since  it 
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is  defined  as  a  two-dimensional  transform,  wp  shall  call  r(w,  .">) 
the  2D-spectrum.  (The  summation  in  (4-33)  is  for  all  time.  How¬ 
ever,  we  are  only  interested  in  t  varying  over  a  small  range, 
namely  that  corresponding  to  the  frame  of  interest.  Therefore, 
just  as  we  are  interested  in  a  short-time  spectrum  P(uj,t)  we  are 
also  interested  in  -  short-time  2D-spectrum  r(u,,Q).  That  is,  the 
short-tine  analysi.  to  be  performed  in  two  dimensions  ) 

From  (4-32)  and  (4-33)  we  have: 

CO  00 

rl.,0)  =  £  L  R(t’t+T)  e-J(“T+E!t).  (4-34) 

t=— «  x=—0° 

It  can  be  shown  that  P.(t,t+r)  can  be  computed  from  T  (w,ft)  by  a 
two-dimensional  inverse  Fourier  transform: 


R(t,t+T)  = 


T  | 

tt/T 

r 

1 

T 

TS\ 

j 

/ 

J 

1 

-tt/T 

ampling 

interval 

period 

equal 

to 

dw  dii 


(4-35) 


ojs=~.  Although  P  (w,t)  is  real  and  even  with  respect  to  to,  r(to,Q) 
is  in  general  complex.  It  has  the  properties: 


r(w+nu)  ,  ft-Wnu)  )=r  (<o,f>)  ,  -co<n,m<°5, 
s  s 

r  (— w,ii)  — r  (u,Q)  , 

r(w,-fl)  =7(«,fi)  , 


(4-36) 

(4-37; 

(4-38) 
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where  the  over-bar  denotes  complex  conjugate.  Theref r(w,ft) 
is  even  with  respect  to  w  and  hermitian  with  respect  ' 

For  a  stationary  process  we  know  that  the  spectrum  is  time- 
independent,  i.e.  P  (u),t)=P  (w)  .  From  (4-33)  we  have 

00 

r(u>fn)=PU)  ) 

t=-oo 

CO 

=2 nrP  (a))  V  uQVl-ms)  ,  (4-39) 

n=-» 

where  u^ (x)  is  the  impulse  function  defined  by: 

uQ(x)=0,  x*0  , 

oo 

and  J  uQ(x)dx  =  1.  (4-40) 

—  OO 

Note  that  the  impulse  function  Uq(x)  is  different  from  the  unit 
impulse  (or  unit  sample)  defined  in  (3-26).  Equation  (4-39) 

says  that  for  a  stationary  di.  erete  process,  the  2D-spectrum 
consists  of  a  set  of  periodic  "line  masses"  with  density  2ttP  (tu)  , 
where  P(w)  is  the  power  spectrum  of  the  process,  in  the  w,.Q 
plane  these  line  masses  are  para) lei  to  the  Q-axis. 

In  order  to  make  the  analysis  below  more  convenient  we 
shall  redefine  the  2D-spectrum  so  that  Q(w,u>')  is  the  double 
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transform  of  R(t,t').  We  substitute  for  t  from  (4-31)  into  (4-34)  „ 
and  let 

to‘  =  oj-Q.  (4-41) 

Then  we  interchange  t  and  t'  and  make  use  of  the  relation 

R(t, t* )  =  R(t* ,t) .  (4-42) 

Equation  (4-34)  then  reduces  to 


00  C30 


'(u>,id’)  =  Y,  £  e~^  (u,t-art,)  .  (4-43) 


t 1 =-«  t=-°° 


The  inverse  relation  is: 


=tfT  /c 

\  I  -71 /T  /T 


R(t,f)  = 


Q((d,w’)  J  (wt-M't')  du  dojf>  (4-44) 


The  2D-spectrum  Q(id,(d')  is  related  to  the  2D-snectrum  F(w,f2)  bv 
the  relation 


Q(w,u')  =  i’  {iii,  w-oj 1 )  . 


(4-45) 


Q(w,w')  is  periodic  and  hermitian  in  w  and  u> *  -  It  obeys  the  re¬ 
lations 


and 


Q ( w+nu)  . a 1 +mu>  )  =  Q(w,w’)  ,  -»<n,m<00, 
s  s 

Q  (-w,-u.’ ' )  =  Q  (u),<d' )  , 

Q  (a)'  f  w)  =  Q((d,w' ) 


(4-46) 

(4-47) 

(4-48) 


4 
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For  a  stationary  process: 

CO 

Q(aj,co’)  =  2ttP  (w)  u^  ( w-o' - nos )  .  (4-49) 

n=-°° 

Just  as  for  F(o,.  C)  in  (4-39),  Q(u),to')  consists  of  a  set  of  perio¬ 
dic  line  masses  with  density  2ttP(cj).  In  the  u),o)'  plane  these 
lines  would  be  diagonal  lines  compared  to  vertical  lines  in  the 
w,.Q  plane. 

We  have  introduced  in  this  section  two  2D-spectra,  F( u), '?) 
and  Q(u,w').  r(w,C)  was  introduced  first  as  a  more  intuitive 
definition  of  the  2D-spectrum  starting  from  a  time-varying  power 
spectrum.  However,  as  we  shall  see  in  the  next  section,  the  Co- 
variance  normal  equations  are  easily  derived  by  working  with 
Q(w,w' )  and  R(t,f)  directly. 

In  the  Autocorrelation  method,  P'j)  was  considered  to  be 
the  short-time  spectrum  for  the  particular  frame  of  interest. 

Several  methods  for  estimating  P(w)  were  mentioned  in  Section  4.4. 
However  for  the  purposes  of  linear  prediction,  it  was  found  that 
the  estimation  of  a  number  of  autocorrelation  coefficients  sufficed. 
Similarly,  in  the  Covariance  method  we  shall  consider  Q(u),u’)  to 
be  the  short-time  2D-spectrum  for  the  frame  of  interest.  How¬ 
ever,  as  we  shall  see  shortly,  we  need  not  estimate  Q(u,u').  All 
that  is  needed  for  the  computation  of  the  predictor  parameters 
is  the  estimation  of  a  set  of  no^stationary  autocorrelation 
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coefficients , 

4. 6  Generalized  Analysis-by-Synthesis  and  the  Covariance  Method 

In  Fig.  4-1  the  signal  s (nT)  is  passed  through  the  inverse 
filter  II (z)  giving  as  output  an  error  signal  e(nT).  Both  s(nT) 
and  e(nT)  are  nov;  assumed  to  be  nonstationary.  The  total  energy 
E  in  the  error  signal  is  given  by  R  (0,0),  where  Re(t,t')  is  the 
nonstationary  autocorrelation  of  the  error  signal  e(nT).  From 
(4-44)  we  conclude  that: 


/  \  2  *{? 

tt/T 

If  / 

j  Qe  (w, to' )  dw  dw*  , 

\  1  -tt/T 

- n/T 

where  Qe(w,a)')  is  the  2D-spectrum  of  the  error  signal.  From  li¬ 
near  system  theory  (Panoulis,  1965,  p.443),  we  can  write  for 
Fig.  4-1: 

Qe(a),w')  =  q(w,w')  H(w)  H(w'),  (4-51) 

where  Q(w,w')  is  the  2D-spectrum  of  the  signal  s(nT).,  and  H(w) 
has  the  same  interpretation  as  before.  Therefore,  tne  total 
energy  in  the  error  signal  is  given  by: 

l  \z  r  t 

E  =  j  j  /  /  Q  (w,co' )  H  (w)  H(w')du;  dw*  .  *  (4-52) 

\  /  -tt/T  -tt/T 


(Compare  (4-52)  with  (4-9)  for  the  stationary  case.) 
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Replacing  the  formula  for  H (to)  in  (4-52)  we  obtain: 


-  n/T  n/T  r  p  ~\  r  p  1 

y  /  /  Ql.,.')  1  -V  *k  l-t  akeik“'T  du  d»'. 

'  -n/T  -n/T  ..  k-1  u  _  k=l 


(4-53) 


In  order  to  minimize  E  we  take  ^  =  0,  lsisp. 

The  result  of  the  differentiation  is: 


It  l2 


it  /T  n/T 


J  j  Q(W/«'  ) 

»  —  /m  —  /rn 


-n/T  -n/T 


-jiwT,  j 
e  J  +eJ 


j  iw  '  T_  ^ 


j  (-iw+kw  ' )  T 


P  1 

V"1  j  (-koj+iw  '  )  T  du)  do  '  =  0. 

■  L  a^° 

k=l 


Using  (4-44)  and  the  property  that  R(t,t')  =  R(t',t)  we  obtain 


Is  r'(- 


iT,-kT)  =  R(-iT,G),  15i<p. 


(4-54) 


We  shall  call  (4-54)  the  generalized  norma]  equations. 

The  minimum  total-squared  error  E  can  be  obtained  by  using 
(4-42),  (4-44)  and  (4-45)  in  (4-53).  The  answer  can  be  shown 
to  be  equal  to: 


=  R  ( 0 , 0 )  -^T  ak  E(-kT,0) 


(4-55) 
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For  the  special  case  when  the  signal  is  stationary, 

R(t,t ' )  =  R(t'-t),  (4-54)  reduces  to  the  Autocorrelation  normal 
equations  (4-12)  ,  and  (4-55)  reduces  to  (4-13) . 

What  we  shall  show  later  in  this  section  is  that  the  Co- 
variance  normal  equations  (3-8)  are  the  same  as  (4-54)  with  the 
nonstationary  autocorrelation  coefficients  R(iT,kT)  being  approxi¬ 
mated  by  the  covariance  coefficients  4>^  defined  in  (3-9).  First 
we  shall  interpret  the  above  results  i.n  terms  of  generalized  ana- 
lys is-by-synthes is . 

4.61  Generalized  Analysis-by-Synthesis 

Following  a  procedure  analogous  to  that  in  Section  4.3,  we 
can  write  from  (4-52)  and  (4-14): 

tt/T  tt/T 

E  =  I  f  f  _  du>  dw'.  (4-56) 

n  -"/T  -It  5(->  *<»•> 


«  **  M  /  X  M 

f  J 

-TT/T  -TT 


21^1 _ d»  aw. 

S(u)  S(oj’) 


(4-56) 


We  shall  define  the  2D-spectrum  of  the  approximate  transfer  func¬ 
tion  S(z)  as 


Substitutina  in  (4-54)  we  have: 


S  (w) 

S  (u>'  ) 

• 

(4-57) 

* 

have 

• 

- 

ir/T 

/ 

tt/T 

f 

au,  aw. 

(4-58) 

* 

-TT/T 

J 

-TT/l 

Q  (o)  ,o'  ) 

< 

t 

I 

33 

which  is  identical  to  (4-16).  Therefore,  (4-16)  is  a  special 
case  of  (4-58)  when  the  signal  is  stationary.  In  Section  4.3 
we  showed  that  the  minimization  of  (4-16)  can  be  considered  as 
a  method  of  analysis-by-svnthesis.  WTiat  we  have  in  the  minimi¬ 
zation  of  (4-58)  is  a  method  of  generalized  analysis-by-synthesis 
where  the  signal  is  in  general  nonstationary.  The  properties 
given  in  Section  4.3  als:>  apply  to  generalized  analvsis-by-syn- 
thesis,  We  note  that  the  minimization  of  (4-58)  results  in  the 
generalized  normal  equations  given  in  (4-54). 


4.62  Reformulation  of  the  Covariance  Method 

All  formulations  of  the  Covariance  method  must  now  obey 
(4-54),  where  the  non's  tat  ionary  autocorrelation  coefficients 
R(t,t')  are  to  be  estimated  in  some  fashion  from  the  speech  signal. 
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The  development  here  will  be  analogous  to  that  given  in  Section 
4.4  for  the  Autocorrelation  method.  We  shall  define  two  basic 
formulations  of  the  Covariance  method:  the  direct  and  indirect 
method.  In  the  direct  method,  the  coefficients  R(t,t')  will  be 
computed  from  an  infinite  signal  that  has  been  windowed  by  a  mov¬ 
ing  window.  In  the  indirect  method,  R(t,t')  will  be  estimated 
from  a  finite  unwindowed  portion  of  +-he  signal.  (Tne  words  "di¬ 
rect"  and  "indirect"  refer  to  whether  -.ie  2D-spectrum  is  computed 
directly  from  the  signal,  or  indirectly  through  an  estimated  auto¬ 
correlation  function.) 

1.  Direct  Method 

We  shall  define  a  nonstationary  (time-varying)  short-time 
spectrum  P(w,t)  as: 

”  .  2 

P(w,t)  =  )  W(T)  s(x-t)  e-:]WT  (4-6Ca) 

J  —  —  CO 

(N-l)T  _  2 

=  X!  s  (t— t )  e  f  (4-60b) 

T=0 

where  s(t)  is  the  original  signal,  and  w(x)  is  a  window  function 
that  is  defined  to  be  zero  for  t<0  and  t>nT.  This  definition  of 
P  (w,t)  is  consistent  with  the  definition  of  P(w)  in  (4-18)  for 
the  stationary  (time-independent)  case.  F(u>,t)  can  be  plotted 
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as  a  function  of  time  in  a  manner  similar  to  a  spectrogram. 


Equation  (4-60a)  can  be  expanded  as: 


P(w,t)  «  w(x)  s(x-t)  e~^!i)Xy'  w(y)  s(y-t)  e^1 


00  CO 


=  y  w(x)  w(y)  s  (x-t)  s  (y-t)  .  (4-61) 


X=-<»  y=-c 


Setting  x-y  =  t  ,  (4-61)  reduces  to: 


oo  CO 


*(•'«  ■  E  E 

T=-oo  X=-<» 


w(x)  w(x-x)  s(x-t)  s(x-T-t)  e”^1.  (4-62) 


By  comparing  (4-62)  and  (4-32)  we  conclude  that: 


( t , t+t)  =  ^  w(:c)  v(x-t)  s(x-t)  s(x-T-t). 


(4-63) 


In  order  to  obtain  R(t,t')  we  set  t  =  t’-t  in  (4-63) 


R(t,t')  -  ^  w(x)  w(x-t'+t)  s(x-t)  s(x-t’). 


(4-64) 


Since  w(x)  =  0,  x<0  and  x> NT,  (4-64)  can  be  written  as: 


(N-l)T 


R  ( t ,  t  ’ )  =  ^  w(x)  w(x-t’  +  t)  s(x-t)  s(x-t') 


(4-65) 
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Setting  t  =  -iT  and  t#  =  -k?  in  ( 4—55)  ,  we  obtain: 


N-l 


R(-iT,-kT)  .  £  wn  wn.i+k  sn+i  snik. 


(4-66) 


n=0 


Equation  (4-66)  shows  how  to  compute  R(-iT,-kT)  for  use  in  the 
normal  equations  (4-54)  to  solve  for  the  predictor  coefficients 
a^.  The  coefficients  represent  the  sampled  window  function. 

We  note  from  (4-65)  ,  (4-66)  and  (4-54)  that  t  varies  between 
-pT  and  -T.  From  (4- 6 Ob)  we  see  that,  corresponding  to  -pTSts-T, 
the  time-varying  spectrum  P(w,t)  can  be  computed  p  consecutive 
times,  and  after  each  computation  the  window  is  moved  one  sample 
interval  T.  While  the  Autocorrelation  method  represents  the  pro¬ 
perties  of  a  single  spectrum  i.i  each  frame,  lh<2  Covariance  method 
represents  the  properties  of  p  consecutive  spectra  in  each  frame. 

2 .  Indirect  Method 

In  this  method  the  2D-spectrum  is  computed  from  an  estimated 
nonstationary  autocorrelation  function  R(t,t')  that  is  computed 
from  a  finite  unwindowed  portion  of  the  signal.  Although  several 
formulations  could  be  defined,  we  shall  give  only  one  which  is 
analogous  to  (4-21)  in  the  indirect  Autocorrelation  method.  Let 
us  approximate  the  nonstationary  autocorrelation  R(iT,kT)  by: 

N-l 

R(iT,kT)  =  £  sp+i  sn+k  ,  l<i,k<p.  (4-67) 

n=0 
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Then  R(-iT,-kT)  is  approximated  by: 


:{-iT,-kT)  =  ^  sn-i  sn-k  ’ 


(4~68) 


But  the  right-hand  side  of  (4-68)  is  equal  to  the  coefficients 
<J>ik  defined  by  (3-9).  Therefore, 


(4-69) 


R(-iT,-kT)  =  4>ik 


R(-iT,  Q)  =  <p. 


Substituting  (4-69)  in  (4-54)  we  obtain: 


£•*♦*-♦10.  lsisp' 


wnich  is  identical  to  the  Covariance  normal  equations  (3-8) .  Also, 
substituting  (4-69)  in  (4-55)  results  in  an  expression  for  that 
is  identical  with  (3-19). 

We  nave  shown  that  the  Covariance  method  can  be  derived  from 
a  frequenov -domain  formulation  where  the  short-time  2D-spectrum 
of  a  nonstaticnary  signal  is  to  be  approxim.  ced  by  an  all-pole 
2D-spectrum,  Under  the  assumption  of  a  stationary  signal,  the 
generalized  formulation  reduces  to  the  Autocorrelation  method. 

The  particular  formulations  presented  in  Chapters  I  and  III  can 
now  be  seen  to  be  the  direct  Au  '.ocorrel  at  ion  and  indirect  Covari¬ 
ance  methods. 
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CHAPTER  V 

THE  AUTOCORRELATION  METHOD  aND  THE  NORMALISED  ERROR 

In  Chapter  IV  it  was  shown  that  the  Autocorrelation  and  Co- 
variance  methods  of  linear  prediction  can  be  considered  to  be 
methods  of  spectral  analysis-by-synthesis,  where  the  short- 
time  spectrum  P  (w)  (or  2D-spectrum  0(01,0)'))  is  approximated  by 
an  all-pole  spectrum  P(w)  (or  2D-spectrum  Q(u>,o>')).  We  have  al¬ 
so  seen  that  in  order  to  determine  the  parameters  a .  of  f>(u)  or 
Q(oi,o)')  ,  it  was  sufficient  to  know  only  a  limited  number  of  auto¬ 
correlation  coefficients  R(kT)  or  R(iT,kT) ;  it  was  never  neces¬ 
sary  to  know  either  P(o))  or  Q(o),o)*).  However,  in  order  to  study 
how  P(o))  (or  Q(o>,o >'))  approximates  P(w)  (or  Q(w, <*>')),  one  must 
be  able  to  compute  the  signal  spectrum  P(u>)  (or  Q(w,w')).  This 
is  most  easily  done  in  the  direct  method  (where  the  signal  is  de¬ 
fined  for  all  time)  by  using  (4-18)  in  the  direct  Autocorrelation 
method  and  (4-60)  in  the  direct  Covariance  method.  Since  it  is 
simpler  to  deal  with  one-dimensional  rather  than  two-dimensional 
spectra,  we  have  chosen  to  study  the  direct  Autocorrelation  method 
in  detail.  Moreover,  in  this  way  we  take  advantage  of  the  body 
of  knowledge  that  already  exists  in  speech  research. 

In  this  chapter  we  shal)  examine  analytically  the  manner  in 

A 

which  the  all-pole  spectrum  P (u)  approximates  the  signal  spectrum 
P(w).  For  the  reasons  stated  above,  this  will  be  done  for  the 
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i 

direct  Autocorrelation  method  only.  We  believe  that  much  in¬ 
sight  into  linear  prediction  in  general  can  be  gained  by  analy-  \ 

zing  this  one  method  in  detail. 


First  ve  examine  the  properties  of  the  approximate  spec- 

/> 

trum  P (ui)  and  the  transfer  function  S(z)  when  compared  to  the 
signal  spectrum  P (w)  and  transfer  function  S(z).  Of  particular 
interest  is  the  analysis  as  when  s(nT)  becomes  the  minimum- 
phase  sequence  corresponding  to  s(nT).  Different  methods  for 
computing  the  minimum-phase  sequence  for  an  arbitrary  sequence ’ 
are  described.  Next  comes  the  analysis  of  the  normalized  error 
and  its  behavior  as  a  function  of  different  spectral  shapes. 

The  normalized  error  is  related  to  the  zeroth  quefrcncy  of  the 
cepstrum  and  is  interpreted  in  terms  of  the  ratio  of  the  geomet¬ 
ric  mean  to  the  arithmetic  mean  of  the  spectrum.  Properties  of 
the  zeroth  quefrency  follow  from  this  analysis.  Then,  the  use¬ 
fulness  of  the  normalized  error  as  a  voicing  detector  is  dis¬ 
cussed.  Of  importance  are  tne  properties  of  the  first  autocorre¬ 
lation  coefficient  R^.  The  chapter  ends  in  a  brief  discussion 
on  the  role  of  the  normalized  error  in  determining  the  optimum 
number  of  predictor  coefficients  in  estimating  the  spectral 
envelope . 


5.1  Properties  of  the  Approximate  Spectrum  P  (u>) 

In  Section  3.5  we  derived  a  relation  between  the  autocor¬ 
relation  function  of  the  windowed  speech  signal  and  the 
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autocorrelation  function  P.^  of  the  impulse  response  of  the  trans- 

a 

fer  function  5(z)  defined  in  (2-2).  This  relation  is  given  by 
(3-35)  and  is  presented  here  with  a  change  of  subscripts: 

\  =  Rk  ,  0<k<p.  (5-1) 

We  know  that  the  autocorrelation  function  has  a  one-to-one  rela¬ 
tionship  with  the  power  spectrum  via  the  Fourier  transform.  Thus, 
Rk  and  Rk  are  the  inverse  Fourier  transforms  of  P(o>)  and  P(w),  re¬ 
spectively  (see  4-lla) .  From  (5-1)  we  see  that  as  the  number  of 
predictor  coefficients  (or  poles)  p  increases,  R^  and  R^  will  be 
equal  over  a  larger  range,  resulting  in  a  better  fit  of  P(w)  to 
P(w).  in  the  limit,  as  p-*00,  Rk  becomes  identical  to  Rk  for  all 
k,  and  hence  the  power  spectra  P(w)  and  P(e)  become  identical: 


£  (<d)  =  P  (w) ,  as  p-*“  . 


(5-2) 


One  may  not  be  interested  in  getting  an  exact  replica  of  P(w), 
but  (5-1)  and  (5-2)  give  one  a  better  understanding  of  the  approx 
mation'  process. 


From  (4-13)  we  have  the  minimum  total-squared  error 


Ep  =  A*.  Substituting  for  Ep  in  (4-16)  we  have: 


it /T 

t-  r  p(u) 


do 


x . 


(5-3) 


4/ TP  (U) 


Equation  .  /  is  independent  of  p,  the  order  of  the  linear  pre 
dictor.  In  particular,  we  know  from  (5-2)  that  as  p-*»  , 

P(w)  »  p(w).  In  that  case,  (5-3)  becomes  an  identity.  In 
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Appendix  B  we  show  that  (5-3)  is  a  special  case  of  a  more  general 
result,  namely  that  the  polynomials  H0(z),  H1(z),...f  H  (z),... 
form  a  complete  set  of  orthogonal  polynomials  with  weight  P(w), 
where  ^(2)  =  H(z)  for  p=n,  and  H(z)  is  the  inverse  filter 
defined  in  (2-3). 

5.2  Properties  of  the  Transfer  Function  S(z) 

From  (4-6)  we  have  P (w)  =  [S(w)|2,  and  P (w)  =  |s(w)|2, 
where  S(z)  is  the  z-transform  of  the  speech  signal  s(nT)  and  S(z) 
is  the  corresponding  transfer  function  of  the  speech  production 
model  according  to  linear  prediction.  We  wish  to  explore  how 
S(z)  might  relate  to  S(z).  We  have  the  definitions: 

SU>  =  Jo  sn  z'n  ’  <5-«> 


and 


V*1 


(5-5) 


where  (5-5)  is  identical  to  (2-2)  except  that  S(z)  and  the  gain 
factor  A  have  been  subscripted  to  indicate  the  order  of  the  pre¬ 
dictor.  The  subscripts  will  be  used  only  when  necessary  for 
disambiguation.  Note  that  the  upper  limit  on  n  ’n  (5-4)  is  now 
N  instead  of  (N-l) ;  this  was  done  here  for  convenience. 


In  light  of  (4-6)  and  (5-2) ,  it  is  natural  to  ask  how  the 
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transfer  functions  S(z)  and  S(2)  are  related  as  p+°°.  Since 
|Soa(w)|‘=  |S(u))|2,  it  might  seem  that  Sw  (z)  will  be  equal  to  S(z). 

A 

However,  this  is  not  true  in  general.  As  p-*00,  S^Cz)  =  S(z)  if 
and  only  if  the  windowed  signal  is  minimum- phase,  i.e.  S(z)  has 
no  zeros  or  poles  outside  the  unit  circle.  We  know  in  general 
that  the  speech  signal  is  nonminimum-phase j  it  sometimes  has 
zeros  outside  the  unit  circle  due  primarily  to  the  glottal  wave- 

A 

form  (Flanagan,  1965,  p.  140).  We  also  know  that  S(z),  in  the 
direct  Autocorrelation  method,  is  always  minimum-phase:  all  the 
poles  are  inside  the  unit  circle  and  there  are  no  zeros.  Further¬ 
more,  there  is  a  unique  minimum-phase  sequence  whose  spectrum  is 
identical  to  P(w).  Since  S^tz)  is  minimum-phase  and  its  spectrum 

A  A 

P^ui)  is  identical  to  P(u>),  we  conclude  that  S^fz)  is  the  trans¬ 
fer  function  of  the  minimum-phase  sequence  corresponding  to  the 

A 

signal  s (nT) .  S  ( z )  can  be  written  as: 


S  (z)  = 


i  * 

_n 

Sn  2 


(5- 6a) 


!-  E  a,  z“k  n=0 


■  I  «>.  ’-P-  - 


B  (z)  , 


(5- 6b) 


A 

where  b(nT)  =  s(nT)  as  p-*-«®,  ar.d  it  is  ectual  to  the  minimum- 
phase  sequence  corresponding  to  the  signal  s (nT) ,  M  is  an  integer 
to  be  determined,  and  B(z)  is  the  z-transform  of  b(nT)  and  is 
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equal  to  Sm(z).  Below  we  shall  describe  how  to  compute  the  se¬ 
quence  b(nT).  Of  particular  interest  in  Section  5.3  will  he  the 
computation  of  Am,  which  from  (5-6)  is  eaual  to 


This  is  shown  by  long  division  of  Am  into  1-  £  a^  z~ v 

k=l 

and  equating  terms  in  (5- 6a)  and  (5- 6b) . 

The  determination  of  the  minimum-phase  sequence  b(nT) 
is  equivalent  to  the  classic  problem  of  factorization  of  the 
spectrum  P(u>)  into 

P(w)  =  B(w)  B(w)  ,  (5-8) 


where  b( to)  is  to  he  minimum-phase.  Kolmogorov  (19  39)  gave  the 
general  solution  of  this  factorization  problem.  Fej£r  (1915) 
gave  another  solution  for  the  special  case  of  rational  spectra. 

We  shall  give  algorithms  based  on  both  methods.  Our  major  source 
for  this  analysis  is  the  1954  Ph.D.  thesis  of  Robinson,  which  w as 
reprinted  in  Geophysics  (Robinson,  1967a) .  The  Fej4r  method  can 
be  found  also  in  Grenander  and  Szego  (1958,  pp.  20-26).  A  third 
method  based  on  linear  prediction  will  then  be  described. 

A  “  Fej4r  Method 

The  Fej4r  method  assumes  only  that  the  expression  for  P(u>) 
is  known.  However,  in  our  problem  we  also  know  S(z).  The 
method  described  below  is  an  adaptation  of  Fejer's  with  S(z) 
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assumed  to  be  known. 

Substituting  z  for  e^10,1,  in  P(w),  we  obtain 

P<z)  =  S  (z)  S(z_1)  ,  (5-9) 

which  from  (5-8)  must  also  ecrual: 

P(z)  =  B  (z)  B  (z""1)  #  (5-10) 

Without  loss  of  generality  we  shall  assume  that  the  samples  sQ  and 

sN  of  the  signal  are  non-zero.  (This  can  always  be  insured  by 
defining  the  signal  properly.)  The  polynomial  S(z)  in  (5-4)  has 
N  zeros,  hence  it  can  be  written  as: 

u  ,  v  , 

S  (z)  =  sQ  yp  (l-ak  z~x)yj~  (1-Bk  z~X)  ,  (5-11) 

k-1  k=l 

where  are  the  roots  inside  the  unit  circle, 

8^  are  the  roots  outside  the  unit  circle, 

and  u  +  v  =  N  .  (5-12) 

(We  shall  ignore  cases  with  roots  exactly  on  the  unit  circle, 
since  they  would  rarely  occur  for  an  actual  signal.)  It  is 
clear  from  (5-11)  that  Sfz”1)  will  have  u  roots  outside  the 
unit  circle  and  v  roots  inside  the  unit  circle.  Therefore, 

P(z)  in  (5-9)  has  a  total  of  2N  roots,  N  roots  inside  the  unit 
circle,  and  their  reciprocals  outside  the  unit  circle.  We  conclude 


»wiw«w» .*•  ■*  <"-sWW««»*  ^ "rzs-?*??#^  ^ *  r-v ;^^/;. ; »  r. 
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v  J 


from  (5-10)  that  B(z)  must  have  N  roots.  Therefore,  M=N  in  (5-6b) 

We  wish  to  have  all  the  roots  of  B(z)  be  inside  the  unit 
circle,  hence 


N 

B(z)  =  V  bn 
n=0  n 


-n 


(5-13) 


=  boTT  (1“ak  2_1)  T T  (1“ek1  z_i)'  (5-14) 

k=l  k=l 

The  roots  of  B(z)  can  be  computed  from  the  roots  of  S(z).  There 
still  remains  the  computation  of  b^.  Since  the  power  spectra 
of  B (z)  and  S(z)  are  identical,  they  must  also  have  identical 
autocorrelation  functions.  In  particular  P^,  the  Nth  autocorrela¬ 
tion  coefficient  must  be  the  same  for  both.  From  (1-6)  (with  N-l 
replaced  by  N) : 

RN  =  *0  ®U  =  b0  bN  •  <5-15) 

By  equating  the  coefficients  of  z“N  in  (5-13)  and  (5-14)  ,  we  have 

U  V  ,-l 


v 


,-l  -1, 


bN  b0  ak  "]~P 

k=l  k=l 


(5-16) 


y 


Substituting  for  b^  in  (5-15)  we  obtain: 


u2  s0  SN 

bQ  =  - - - 


TT 

rr 

k=l 

] 

<= 

,-l 


(5-17) 


From  (5-11),  (5-14)  and  (5-17),  the  specification  of  B(z)  is 


96 


Report  No.  2304 


Bolt  Beranek  and  Newman  Inc. 


complete.  From  (5-6)/  B(z)  =  S^iz) .  and  we  have  now  determined 
the  tracer  function  S(z)  as  p+».  Note  in  (5-13)  that  the 
sequence  b{nT)  is  of  equal  length  to  s (nT) ,  and  b(nT)  =  0  for 
n<0  and  n>u. 


Co«iw utational  Considerations 

A 

The  main  problem  in  binding  Sm(z)  is  computing  the  N 
ror  *  -  S(z).  For  25  msec  of  10  kHz  sampled  speech/  N=250. 

Finding  the  roots  of  a  250-  or  even  a  100-degree  polynomial  is 
a  major  undertaking.  To  say  the  least,  the  method  we  have  just 
outlined  is  highly  impractical.  The  main  reason  for  the  above 

A 

discussion  was  to  show  that  althouqh  S^tz)  has  an  infinity  of 
poles,  it  can  be  written  as  a  polynomial  with  a  finite  number  of 
zeros.  Also,  the  minimum-phase  sequence  b(nT)  has  the  same 
length  as  the  original  sequence  s(nT). 


B-  Cepstral  Method  -  (Kolmogorov  Method) 

Although  Kolmogoro  did  not  use  the  word  "cepstrum" 
to  refer  to  the  Fourier  transform  of  the  logarithm  of  the  spectrum, 
the  operation  itself  was  used.  A  more  recent  analysis  of  this 
subject  can  be  found  in  Cppenheim  and  Schafer  (1968)  .  We  shall 
make  use  of  the  latter  reference  below. 

The  problem  again  is  to  compute  the  minimum-phase 
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sequence  b(riT)  corresponding  to  the  speech  sequence  s(nT).  The 
z-transform  of  b(nT)  is  (z) .  Below  we  shall  drop  the  subscript 
«  and  simply  use  S(z)  as  the  minimum-phase  transfer  function  cor¬ 
responding  to  S(z) . 


Let  the  cepiLrum  c(r.T)  of  S(z)  be  defined  as: 

n/.T 

cn  -  J  log|S(w)|2  ejnw"  dm 
-tt/t 


=  2^-  j'  log  P  (it}}  e^nwT  do)  . 


-it/  T 


(5-18) 


The  cepstrum  c(nT)  of  S(z)  can  be  similarly  defined.  Since 
^2  2 

|s(u))|  =  js(w)|  (the  spectra  are  identical)  ,  we  conclude  that 

c  =  cn.  V,'e  note  from  the  properties  of  the  spectrum  and  (5-18) 

that  c  is  real  and  even, 
n 


Let  the  complex  cepstrum  c1 (nT)  of  s(z)  be  defined  as: 


cn  ~  J  e^nu)ri  dw. 


-tt/T 


S(w)  can  be  written  as: 


S  (w)  =  |  S  (u)  j  e 


j0  (o>) 


=  Ism  |  e3e‘"!  . 


Therefore,  log  S(w)  -  log|s(w)j  +  je(«). 


(5-19) 


(5-20) 


(5-21) 
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Log|S(w)|  is  an  even  function  of  frequency  and  0 (w)  is  a 
continuous  odd  function  of  frequency.  Therefore,  c^  is  a 

A 

real  function.  Furthermore,  since  S(z)  is  minimum-phase  we 
have  (Oppenheim  and  Schafer,  1968)  i 


c^  *  0  ,  n  <  0  .  (5-22) 

From  (5-21),  (5-19)  and  (5-18)  we  conclude  that  the  even  part 

c 

of  should  be  equal  to  -j—  : 

Even  [«•]  =  j  c-  +  =:nj  -  j  en  , 

or  c'  +  c'  =  c  .  (5-23) 

n  -n  n 


(5-25) 

(5-26) 
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Equations  (5-18) ,  (5-24),  (5-25)  and  (5-26)  specify  the  sequence 


of  computations  needed  to  find  the  minimum-phase  sequence  b(nT). 


Of  particular  interest  is  the  value  of  bn  which  from  (5-26), 


(5-25)  and  (5-24)  is  equal  to: 


c0  V2 

b0  =  e  =  e 


.2  -  a  0 

b0  e 


(5-27) 


This  result  will  be  important  in  the  next  section. 


Computational  Considerations 


The  power  spectrum  P(u)  =  js(w)j  is  a  continuous  function 


of  frequency  and  so  is  log  P  (u) .  The  cepstrum  c(nT),  which  is 


the  inverse  transform  of  log  P(w),  is  potentially  infinite  in 


extent.  In  practice,  the  cepstrum  becomes  negligibly  small  at 


high  cepstral  values  (or  quefrencies) .  Therefore,  P(w)  must  be 


computed  to  have  enough  resolution  such  that  no  cepstral  aliasing 


occurs.  This  criterion  is  realized  by  trial  and  error. 


We  shall  give  the  whole  algorithm  in  machine- implementable 


form.  We  assume  that  we  are  given  the  sequence  s  (nT) . 


(1)  Take  the  FFT  of  s(nT)  with  enough  zeros  appended  to 
give  sufficient  spectral  resolution,  giving  S(u>)  at 
a  finite  number  of  equispaced  frequencies.  Let  this 


number  be  M. 


(2)  Compute  M  values  of  C(u)  =  log|s(co)| 


«  t 


:V^v 


w-1-"  *A  ’ 


-  -fi  'j  '*%*>*;r 
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(3)  Take  the  inverse  M-point  FFT  of  C(w)  to  obtain 
M  points  of  c(nT). 

(4)  Compute  from  cn  as  follows: 


f 2  c0  '  n  *  0  ' 

I  cn  *  0<n<j  , 

|  1  _  _  _  M 

2  °VL/2  '  n  2  • 


VO  ,  2<n<M-l  . 


(5-28) 


Note  the  differences  between  (5-28)  and  (5-24).  The  changes  are 
necessary  in  order  to  deal  with  a  finite  instead  of  the  theoreti 
cally  infinite  sequence. 

(5)  Take  the  FFT  of  c^  ,  to  obtain  log  S  (w)  =  log  |s(<*>)  1  + 
j Q  (oj)  at  M  frequency  values. 

(6)  Compute  3  (w)  =  |S(w)|  cos  [0  (us)]  +j  j  3  (cu)  {  s:n[0(u)]. 

(7)  Take  the  inverse  FFT  of  S  (u>)  to  obtain  b(nT). 


M  must  be  greater  than  N ,  the  number  of  samples  in  the  sig¬ 
nal.  A  value  of  M  =  2N  gives  good  results  for  a  windowed  signal 
with  large  N(,'-250).  b(nT)  should  come  out  to  be  zero  for  n>N, 
but,  in  practice  it  will  have  small  values  in  that  region. 

Another  occasional  source  of  problems  in  this  method  is 
when  one  of  the  values  of  P(w)  approaches  zero,  the  logarithm 
approaches  -®.  For  a  speech  signal  this  problem  is  most  likely 
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to  occur  when  the  d.c.  value  is  zero.  This  problem  will  be  dis¬ 
cussed  further  in  Section  5.4  in  connection  with  the  computation 
of  cQ. 

C  -  Linear  Prediction  Method 

From  (5-6),  the  sequence  b(nT)  can  be  obtained  by  long  divi- 
00  v 

sion  of  Aro  into  l- £  a.z”  *  However,  one  must  first  know  hm 

k=l 

as  well  as  a^  for  all  k.  This,  of  course,  is  not  possible,  but 
one  can  make  an  approximation  to  Seo(z)  by  considering  Sp(z)  in 
(5-5)  for  a  large  value  of  p.  The  computation  of  the  predictor 
parameters  a^  is  then  possible  by  the  Fast  Autocorrelation  method 

(see  Appendix  B) ,  and  A  is  computed  from  (3-36) .  Dividing  A 

p  -k  P  P 

by  1-^jT  akz  gives  a  polynomial  whose  coefficients  approximate 

the  minimum-phase  sequence  b(nT). 


Figure  5-la  shows  a  windowed  signal  s(nT)  of  duration  25.6 
msec  (10  kHz  sampling  rate).  The  minimum-phase  sequence  b(nT) 
corresponding  to  s(nT)  was  computed  by  two  methods:  the  cepstral 
method  and  the  linear  prediction  method.  Figure  5-lb  shews  the 
approximation  to  b(nT)  as  computed  by  the  cepstral  method  using 
512-point  FFT's  (256  zeros  were  appended  to  s(nT)).  Figure  5-lc 
shows  the  approximation  to  b(nT)  as  computed  by  the  linear  pre¬ 
diction  method  with  p  =  250  .  Ali  the  figures  are  normalized  to 
the  same  maximum  amplitude.  For  a  given  accuracy,  the  cepstral 
method  is  more  efficient  than  the  linear  prediction  method. 
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s(nT) 


* 


1  tnT)  ‘  i  V/V‘  ^  *J~^’\r^ - ~ 

V  '/ 


(c)  Linear  Prediction  Method 


Fig.  5-1  Computation  of  the  minimum-phase  sequence 
b(nT)  corresponding  to  the  windowed  signal 
s (nT) . 

(a)  a (nT)  -  25.6  msec,  10  kHz  sampled  speech. 

(b)  Cepstral  method  using  512-point  FFT. 

(c)  Linear  Prediction  method  using  p**250. 
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5 . 3  Analysis  of  the  Normalized  Error 

v 

A  ■.. 

We  mentioned  in  Section  5.1  that  as  p-*®,  P(w)  becomes  iden- 

V. 

\ 

tical  to  P(w) .  In  this  section  we  shall  examine  this  process  of 

4 

approximation  by  analyzing  the  behavior  of  the  normalized  mini¬ 
mum  total-squared  error  V  ,  or  simply  the  normalized  error.  - 

The  normalized  error  was  defined  in  Section  3.3  as  the  mini- 

i 

mum  total-squared  error  divided  by  the  energy  of  the  signal  s(nT] 

\ 

2 

(5-29) 


Ers  K 

v  =«£=_£. 
R«  TC  ' 


or 


P 

V  = 


P  1  "  rk  ' 


(5-30) 


where 


rk  = 


Rk 


(5-31) 


are  the  normalized  autocorr?  •  ition  coefficients,  which  have  the 

property  that  jr^jsl,  for  all  k.  The  sum.  on  the  right- 

k=1 

hand  side  of  (5-30)  cannot  be  negative  since  the  choice  a^=0, 

Isk<p,  would  reduce  V  .  This  is  not  possible  because  V  is  al- 

P  p  P 

ready  a  minimum.  Therefore,  ^a^r^O  must  always  hold  and 

By  an  argument  similar  to  the  above  one  can  show  that  Vp+^£Vp, 

and  hence  that  V  is  a  monotonically  decreasing  function  of  p. 

P 

As  p-*®,  Vp  approaches  a  minimum  value  Vra  =  Vmin“°*  ^-atter 

condition  is  true  because  Vp  is  a  normalized  squared  error  and 
therefore  Vp>0.  Hence, 


0<VP<1 


(5-32) 
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This  result  will  be  shown  in  a  different  way  later. 


Figure  5-2  shows  normalized  error  curves  as  a  function  of 
p  for  the  unvoiced  fricative  £ s ]  in  the  word  "list”  and  the  vowel 
[ffi]  in  the  word  ‘'potassium" .  The  speech  signal  was  lowpassed  at 
4.5  kHz  and  sampled  at  10  kHz.  Each  of  the  two  error  curves  de¬ 
creases  monotonically  towards  its  own  asymptote  Vn^n  as  p-*°°. 

The  largest  single  drop  in  both  error  curves  occurs  for  p=l. 

Thus  is  indicative  of  the  eventual  levels  of  the  error  curves. 
It  is  instructive  to  examine  the  behavior  of  for  different 
sounds.  From  the  flow  chart  in  Appendix  B  we  note  that  for  p=l, 

al=Rl/R0=rl* 


do 


(5-34) 


Rq  is  the  integral  of  the  spectrum,  which  is  equal  to  the  total 
energy  in  the  signal.  is  the  integral  of  the  cosine-weighted 

spectrum.  The  cosine  weighting  is  shown  in  Fig.  5-3.  Low  frequen¬ 
cies  are  weighted  positively,  high  frequencies  are  weighted  nega¬ 
tively,  while  mid  frequencies  do  not  contribute  much  to  the  value 
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of  R^.  So,  what  is  important  in  determining  the  value  of  is 
the  energy  balance  between  lo w  and  high  frequencies.  For  ex¬ 
ample,  sonorants  usually  have  most  of  their  energy  concentrated 
at  low  frequencies,  resulting  in  a  value  of  very  close  to  RQ. 
Typically  r^>.85  for  sonorants,  and  from  (5-33)  V^<.25.  On  the 
other  hand,  unvoiced  frication  has  the  energy  either  distributed 
over  the  whole  frequency  range  or  is  more  concentrated  at  high 
frequencies.  Typical  values  of  are  such  that  -.5<r1<.5,  with 
negative  values  being  more  likely  for  strident  fricatives.  This 
results  in  a  V^>.75.  Note  that  it  is  the  absolute  value  of  r-^ 
that  is  important  in  determining  the  value  of  V^.  Figure  5-4 
shows  a  plot  of  as  a  function  of  r^.  If  most  of  the  energy 
in  the  spectrum  is  concentrated  at  high  frequencies  then  r^  be¬ 
comes  close  to  -1  and  becomes  very  small.  In  general,  any 
particular  spectrum  and  its  mirror  image  (low  and  high  frequen¬ 
cies  interchanged)  have  identical  values  for  V^. 

Above  we  tried  to  make  three  points:  1)  One  can  get  in¬ 
sight  into  the  general  level  of  the  normalized  error  curve  by 
examining  the  behavior  of  V^«  2)  The  value  of  depends  on 

the  absolute  value  of  the  normalized  first  autocorrelation  coef¬ 
ficient  rj=R^/Rg.  3)  The  value  of  depends  on  the  relative 

energy  distribution  in  the  spectrum.  In  order  to  get  more  insight 
into  the  behavior  of  the  normalized  error  curve,  we  must  examine 
V  as  p  varies.  This  requires  that  we  examine  the  autocorrelation 
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function  since  Vp  is  a  function  of  only  the  autocorrelation  co¬ 
efficients  Rj. #  lSk<p.  This  can  be  seen  from  (5-30)  and  the  fact 
that  the  predictor  coefficients  are  computed  from  the  autocorrela 
tion  coefficients  by  solving  (3-17).  The  expression  for  Vp  in 
terms  of  the  autocorrelation  coefficients  becomes  very  compli¬ 
cated  as  p  increases#  and  very  little  insight  can  be  gained  in 
that  direction.  On  tne  other  hand,  we  know  that  there  is  a  one- 
to-one  relationship  between  the  autocorrelation  function  and  the 

spectrum.  Therefore#  an  alternate  course  is  to  examine  V  as  a 

P 

function  of  the  spectrum.  This  relation  could  be  obtained  from 
the  results  of  Section  5.2  on  minimum-phase  sequences#  but  we 
shall  give  a  more  direct  derivation  below.  The  expression  for 

f. 

Vp  will  be  in  terms  of  the  zeroth  coefficient  (quefrency)  of 

A 

the  oepstrum  corresponding  to  P  (<*>)„  An  expression  for  Vmin  then 
follows  directly. 

Substituting  P(w)  for  P(w)  and  cn  for  in  (5-18),  and 
letting  n=0#  we  obtain:  * 

tt/t 

C0  =  W  /  log  P(a>)  dw  •  (5-36) 

/  .V 

cQ  is  just  the  integral  of  the  logarithm  of  the  approximate  spec¬ 
trum  P  (w)  .  c0  is  a  function  of  p  since  P  (u>)  is  a  function  of  p. 
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The  approximate  spectrum  P  (u>)  in  (4-6a)  can  be  rewritten  as: 

.2 


P(w)  « 


P 

IT 

k=l 


1-z,  e 


-  j**T 


k 


( 5—  37  > 


A 


TT 

k=l 


i+i2kr-2[  zkrcos(o)T)+zkisin(a)T)  ] 


where  Zy  =  zkj_  +  jz k^,  l<k<p,  are  the  poles  of  the  transfer  func¬ 
tion  S(z),  and  zkj_  and  z^^  are  the  real  and  imaginary  parts  of 
the  poles,  respectively.  Since  the  logarithm  of  a  product  is 
equal  to  the  sum  of  the  logarithms  of  its  elements,  (5-37)  can 
be  substituted  in  (5-36)  to  obtain  (after  interchanging  integra¬ 
tion  and  summation^ : 

p  v/rT 

cQ  =  log  Ap  "  V*  yjf  J  log  (l-J-jzki2-2[zkrcos(wT)+zkisin(wT)  ])du;. 
fel  -ti/T 

(5-38) 

A 

Since  all  the  poles  of  S(z)  are  guaranteed  to  be  inside  the  unit 
circle,  we  nave  |zk|<l,  lsksp.  For  this  special  case,  the  integral 
in  (5-38)  is  equal  to  zero  (Gradshteyn  and  Ryzhik,  1963,  p.542). 
(For  |zk|>l,  the  integral  multiplied  by  2^-  is  equal  to  lcgjz,.  |2.) 

Therefore : 


=  log  A  =  log  E  . 


(5-39) 
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The  zeroth  coefficient  of  the  approximate  cepstrum  is  equal  to 
the  logarithm  of  the  minimum  total-squared  error.  Substituting 


(5-39)  in  (5-29)  we  obtain  the  desired  result; 


V  = 


(5-40 ) 


(Note  that  Rq  =  Ry  for  all  p.) 


From  (5-2)  we  know  that  as  p-*-00,  P(w)  becomes  equal  to  P(w). 


Substituting  P  (u>)  for  P(ui)  in  (5-36)  and  the  result  in  (5-40), 


we  obtain  an  expression  for  the  minimum  normalized  error  V  .  =V  • 

mxn  20 


V  .  =  — - 

min  R, 


(5-41) 


where  cQ  is  the  zeroth  coefficient  of  the  siqnal  cepstrum,  and 
Rq  is  the  energy  ir*  the  signal. 


Equation  (5-41)  can  also  be  derived  from  the  results  of 


Section  5.2.  From  (5-29),  (5-7)  and  (5-27)  we  have: 


Vmin  *  Rq  Ry 


Also,  since  the  impulse  response  sn  corresponding  to  S(z)  is 


minimum-phase,  and  sQ  =  A  from  (3-*28),  we  have: 


.2  s2 

A_  sn 


v  =  _E  =  _£  =  - —  . 

n  *  * 
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It  is  instructive  to  write  (5-41)  as  a  function  of  P  (&>) : 

n 

log  P (u)  dm 

_  _ _ _ d _  (5-42) 

mm  tt/T 

P(ui)  dw 

-tt/T 

It  is  clear  from  (5-42)  that  depends  completely  on  the  shape 

of  the  signal  spectrum.  Similarly,  from  (5-40),  depends  com¬ 
pletely  on  the  shape  of  the  approximate  spectrum.  This  fact  is 
very  important  in  interpreting  the  behavior  of  the  normalized 
error  curve  for  the  spectra  of  different  sounds.  For  example, 
in  Fig.  5-2  the  error  curve  for  the  unvoiced  fricative  [s]  is 
much  higher  than  that  for  the  vowel  [*].  On  the  whole,  unvoiced 
sounds  have  a  high  error  curve  while  voiced  sounds  have  a  much 
lower  error  curve.  This  property  of  voiced  •■ersus  unvoiced 
sounds  has  been  observed  before  (Atal  and  Hanauer,  1971?  liarkel, 
SCRL  Mon.  1971) ,  and  V  has  been  suggested  as  a  possible  para- 
meter  for  the  detection  of  voicing.  However,  with  our  result 
showing  that  the  error  curves  are  dependent  only  on  the  shape  of 
the  spectrum,  it  is  clear  that  what  makes  this  apparent  dichotomy 
between  voiced  and  unvoiced  sounds  has  nothing  to  do  with  the  fact 
of  voicing  itself,  but  rather  with  the  shapes  of  the  spectra  cor¬ 
responding  to  these  sounds. 


Report  No.  2304 


Br ' t  Beranek  and  Newman  Inc 


By  examining  the  behavior  of  in  (5-42)  one  gains  ins.’  \c 

into  how  the  error  curves  change  for  different  shapes  of  the 
trum.  For  example,  it  easy  to  show  that  if  the  spectrum  is  per¬ 
fectly  flat,  then  =  1,  and  the  error  curve  is  the  highest 

possible.  On  the  other  hand,  if  all  the  energy  is  concentrated 
in  certain  regions  of  the  spectrum  and  the  rest  of  the  spectrum 
contains  zero  energy,  then  Vm^n  =0,  and  the  error  curve  is  the 
lowest  possible.  Speech  sounds  lie  somewhere  between  these  two 
extremes.  In  general,  voiced  sounds  (especially  sonorants)  have 
most  of  the  energy  concentrated  in  one  region  at  low  frequencies, 
resulting  in  low  error  curves.  Unvoiced  sounds,  on  the  other  hand, 
have  the  energy  more  evenly  distributed  across  the  spectrum,  re¬ 
sulting  in  higher  error  curves.  However,  this  property  cannot  be 
relied  upon  all  the  time.  As  an  example.  Fig.  5-5a  shows  the  error 
curve  for  the  burst  [k]  in  the  word  "concentration".  The  error 
curve  is  low  although  the  sound  is  unvoiced.  In  this  case,  this  was 
due  to  the  fact  that  the  [kj  spectrum  had  a  single  sharp  peak  where 
most  of  the  energy  was  concentrated  (see  Fig.  5-5b) . 

An  interesting  way  to  look  at  V  ^  in  (5-42)  is  to  view  it  as 

the  ratio  of  the  geometric  mean  to  the  arithmetic  mean  of  the 

spectrum,  where  the  notions  of  the  geometric  and  arithmetic  means 

have  been  extended  to  the  continuous  case.  This  becomes  clear  if 

one  assumes  that  the  spectrum  P  (u>)  is  approximated  by  a  staircase 

spectrum  with  N  distinct  values  Pv  over  the  frequency  range  -it 

x  ijr— 
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(b) 


Fig.  5-5.  (a)  Normalized  error  curve  for  the  [k]  burst 
in  the  word  "concentration". 

(b)  Burst  spectrum. 
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and  n  .  In  that  case,  (5-42)  reduces  to: 

T 


V  .  = 

mxn 


i£> 

.  k=l 

i  £'„ 

k-1 


H  VI /N 


(5-43) 


S  Lpk 

k=l 


which  is  the  ratio  of  a  geometric  mean  to  an  arithmetic  mean. 

Such  a  ratio  has  been  useful  in  acoustic  signal  nrocessing  in 
getting  bounds  on  the  difference  between  averaging  logarithms 
versus  taking  the  logarithm  of  the  average  of  measured  data  samp¬ 
les  (Cox,  1966;  Hershey,  1972).  (This  difference  is  simply  the 
logarithm  of  V  .  in  our  case.)  It  is  well  known  that  the  ratio 
in  (5-43)  is  equal  to  1  if  all  the  data  are  equal,  and  the  value 
decreases  as  the  spread  of  the  data  increases.  A  larger  spread 
is  equivalent  to  heavy  concentrations  in  certain  regions  and  a 
simultaneous  lack  of  energy  in  the  other  regions  of  the  spectrum, 
i.e.  the  spectrum  has  a  large  dynamic  range. 

In  order  to  get  a  better  feel  on  how  V  .  varies  with  dif- 
^  min 

ferent  spectral  shapes,  we  shall  compute  the  ratio  in  (5-42)  for 
three  models  of  the  spectrum:  (a)  a  two-level  model,  (b)  a 
single-pole  model,  and  (c)  a  double-pole  model.  Below,  we  shall 
refer  to  the  ratio  in  (5-42)  simply  as  V;  it  is  the  ratio  of  the 


geometric  mean  of  a  function  to  its  arithmetic  mean. 
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A.  Two-Level  Model 

The  two- level  model  is  shown  in  Fig.  5-6 a.  The  spectrum 

consists  of  two  levels:  a  high  level  labeled  H,  and  a  low  level 

labeled  L.  In  Fig.  5-6a,  for  y  =  0  or  y  =  i  ,  the  spectrum  is 

flat  and  from  (5-42),  V  =  1.  For  0<y<i ,  0<V<1.  Therefore,  for 

fixed  H  and  L,  there  must  exist  some  y  for  which  V  has  a  minimum 

■*  m 

value  V  .  We  shall  find  this  value  of  V  as  a  function  of  H  and  L. 
m  m 

From  (5-42)  and  Fig.  5-6a: 


[y  log  H  +  (1-y)  log  Ll 

v  - 

yH  +  (1-y)  L 


(5-44) 


v  can  be  shown  to  be  equal  to: 
Jm 


y 


m 


1 _ 1_ 

log  d  d-1 


(5-45) 


where  d 


H 

L 


(5-46) 


l  ; 


A 


will  be  known  as  the  dynamic  range. 

Substituting  (5-45)  and  (5-46)  in  (5-44)  we  obtain  V  ,  the  lower 
bound  on  V: 


Vm  =  y  e(1“Y)  ,  (5-47) 

where  y  =  .  (5-48) 

(5-47)  is  the  expression  for  the  lower  bound  on  V  for  a  particular 
value  of  the  dynamic  range  d.  Figure  5-6b  shows  a  plot  of  Vm  versus 
the  dynamic  range  D  in  dB,  where 


116 


Bolt  Beranek  and  Newman  Inc 


0  20  30 

1C  RANGE,  0=  10  !og10  ~  (dB) 


(-level  spectral  model, 
dot  of  Vp,  the  lower  bound  on  the  ratio 
versus  tne  dynamic  range  of  the  spectrum. 
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D  =  10  log1Q  d  =  10  log1Q  £  .  (5-49) 

For  example,  for  a  dynamic  range  D  =  20  dEr  we  read  from  Fig.  5- 6b 
that  the  lower  bound  on  the  value  of  V  for  any  two- level  spectrum 
with  dynamic  range  of  20  dB  is  0.12. 

In  terms  of  the  value  of  V,  it  is  clear  from  the  properties 
of  the  integrals  in  (5-42)  that  the  two-level  model  in  Fig.  5-6a 
also  applies  to  any  other  spectral  shape  that  has  only  two  values 
(levels).  The  importance  of  the  twc-ievel  model  to  other  multi¬ 
valued  spectral  shapes  is  in  providing  a  lower  bound  on  V  for  all 
other  shapes.  This  is  made  explicit  by  the  following  lemma  and 
theorem. 

Lemma  ;  Let  H  and  L  be  the  highest  and  lowest  spectral  values 
for  any  spectrum  with  total  energy  F.Q.  There  exist*  a 
unique  two-level  spectrum,  such  as  shown  by  Fig.  5-6a, 
whose  two  levels  are  H  and  L  and  whose  total  energy  is 
equal  to  Kq.  This  two-level  spectrum  has: 

Ro  -L 

y  =  -  .  (5-50) 

H  -L 

Theorem  1 ;  For  a  given  H,  L  and  RQf  the  value  of  V  for  the  two- 
level  spectrum  determined  by  (5-50)  is  a  lower  bound 
on  the  value  of  V  for  any  spectrum  with  maximum  and 
minimum  values  H  and  L,  anu  total  energy  R^. 

The  derivation  of  (5-50)  is  straightforward.  However,  the  proof 
of  the  theorem  is  more  involved  and  will  not  be  given  here.  The 
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method  of  proof  is  to  make  a  perturbation  to  the  shape  in  Fig.  5-6a, 
keeping  RQ  constant,  and  proving  that  the  resultant  spectrum  has  a 
higher  V  than  that  for  the  original  two-level  spectrum. 

Another  way  to  state  Theorem  1  is  to  say  that  for  a 
certain  dynamic  range  D  and  energy  Rq,  the  two-level  spectrum 
gives  the  minimum  value  for  V.  Moreover,  we  have  seen  that  for  a 
particular  dynamic  range  D  there  is  a  particular  two-level  spectrum 
determined  by  (5-45)  which  gives  a  value  that  is  a  lower  bound 
for  all  two-level  spectra  with  dynamic  range  D.  This  leads  us  to 
the  following  theorem: 

Theorem  2  ;  The  ^alue  given  by  V  in  (5-47)  is  an  absolute  lower 

bound  on  the  value  of  V  for  any  spectrum  with  a  given 
dynamic  range  D. 

By  equating  the  value  of  y  in  (5-50)  and  (5-45)  one  can  solve  for 
Rq,  resulting  in  the  following  corollary: 

Corollary  ;  A  spectrum  with  maximum  and  minimum  values,  H  and  L, 
and  total  energy  Rq  given  by 


R 


0 


H-L 

.  H 
log  j- 


(5-51) 


has  an  equivalent  two-level  spectrum  as  determined  by 
(5-50)  whose  value  for  V  is  given  by  Vm  in  (5-47). 


How  close  the  value  of  V  for  a  particular  spectrum  comes  to  Vm 
depends  on  how  well  that  spectrum  can  be  approximated  by  a  two- level 
spectrum  and  how  close  R^  is  to  the  value  given  by  (5-51) .  As 
an  example  of  the  latter  condition,  if  the  dynamic  range  D  =  20 
to  30  dB,  then  the  total  energy  RQ  must  be  approximately  7-8  dB 
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below  H  for  (5-51)  to  apply.  For  actual  speech  spectra,  if  H  and 

L  are  those  of  the  spectral  envelope  then  the  general  shape  of  the 

curve  in  Fig.  5-6b  applies,  though  the  actual  values  of  V  are 

usually  higher  than  those  in  the  figure.  As  a  general  statement 

one  can  say  that  the  value  of  the  normalized  error  decreases  as  the 

spectral  dynamic  range  increases. 

B.  Single  and  Double-Pole  Models 


t 

i 

j 

$ 


* 

! 

i 

A 

« 

ft 


The  two-level  model  concentrated  on  the  effect  of  the  spec¬ 
tral  dynamic  range  on  the  value  of  V.  here  we  shall  examine  the 

effect  of  the  general  slope  of  the  spectrum  on  the  value  of  V. 

First  we  shall  derive  V  for  an  arbitrary  pair  of  poles,  then 

we  deal  v/ith  special  cases.  Let  the  two  poles  be  at  z=a  and  z=b, 

both  inside  the  unit  circle.  The  transfer  function  for  the  two 

poles  can  be  represented  by 


1 

X  (z)  =  - , - , -  ,  |a|  <1, 1  b  |  <1 .  (5-52) 

(l-az_J-)  (l-bz_±) 
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In  computing  V  from  (5-42)  we  also  need  the  numerator,  which  is 


equal  to  e 


Following  a  derivation  similar  to  that  in  Section  5.2 


(equations  5-36  to  5-39)  we  conclude  that  cQ  =  0  for  X(z),  and 
c0 

hence  e  =1.  Therefore, 


„  _  1  _  (1-ab) (i-a2)  (1-b2) 

V"^ - TT1E - - 


(5-55) 


(5-55)  is  true  for  any  pair  of  poles  inside  the  unit  circle. 

Complex-Conjugate  Fair  of  Poles  »  Here  b  is  the  complex  conjugate 

of  a,  b  =  a.  Therefore, 

2 

V  =  -k*  [1  +  r4  -  2r2  cos  (2wT)  ]  .  (5-56) 

i+r 

where  r  =  |a|  =  jb| 

and  wT  =  angular  position  of  a  or  b. 

Double  Real  Poles  :  a  =  b,  both  real. 


'!  3 
! 


A 


V  =  !1-b' 
1+tj 


Single  Real  Pole  •  a  =  0,  b  is  real, 


(5-57) 


V  =  1-b  . 


(5-58) 


Note  that  b  could  be  either  positive  or  negative.  Recall  that  a 
positive  real  pole  corresponds  to  the  usual  real  pole  in  the  analog 
domain,  while  a  negative  real  pole  in  the  z-plane  behaves  like  a 
pair  oi  complex  conjugate  poles  at  half  the  sampling  frequency  (see 
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Appendix  A).  The  spectrum  of  a  negative  real  pole'  is  just  the 
mirror  image  of  the  spectrum  of  a  positive  real  pole.  While  the 
spectrum  of  a  positive  real  pole  slopes  down  at  approximately 
6  dB/octave,  that  of  a  negative  real  pole  slopes  u£  at  the  same 
rate.  Note,  however,  that  the  value  of  V  in  (5-58)  is  the  same 
whether  the  pole  is  positive  or  negative.  The  same  goes  for  the 
double  real  poles  in  (5-57). 

Using  linear  prediction  we  approximated  the  spectra  of 
several  sounds  from  a  single  male  speaker  by  single  and  two- pole 
spectra.  The  speech  signal  was  low-pass  filtered  at  4.5  kHz  and 
sampled  at  10  kHz.  The  results  showed  that  most  sonorants  were 
well  approximated  by  a  complex  pair  of  poles  with  a  Q  (ratio  of 
frequency  to  bandwidtli  of  resonance)  of  between  .5  and  2.  The 
frequency  of  the  resonance  ranged  from  about  200  to  700  Hz  for 
different  sonorants.  ft]  bursts  v/ere  approximated  by  a  complex 
pair  of  poles  at  around  ?  kHz  with  a  0  of  1.5.  (Host  of  the  high 
frequency  energy  in  the  burst  had  been  filtered  out.)  The  frica¬ 
tive  (§]  was  also  modeled  by  a  complex  pair  at  about  2700  Hz  with 
a  Q  of  2.  On  the  other  hand,  the  fricative  [ s ]  was  approximated 
by  two  real  poles:  one  negative  and  one  positive.  When  the  ap¬ 
proximation  was  restricted  to  a  single  pole,  the  pole  was  nega¬ 
tive  and  positioned  around  the  real  frequency  1000  Hz  (i.e.  the 

pole  is  at  5  kHz  with  a  half  bandwidth  of  1000  Hz) . 
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The  values  of  V  in  (5-56)  for  complex  pairs  of  poles  with 
low  Q  is  quite  close  to  that  for  a  double  pole  in  (5-57)  at  the 
same  frequency.  Therefore  we  shall  give  the  values  of  V  in 
(5-57)  for  different  frequencies.  This  is  shown  as  a  graph  in 
Pig.  5-7.  For  sonorants,  values  of  V  are  seen  to  range  from  about 

-.01  to  .1.  Also  shown  in  Fig.  5-7  is  a  graph  of  V  in  (5-58)  for 
a  single  real  pole  (positive  or  negative)  with  real  frequency  as 
the  abscissa.  The  value  for  [s]  would  be  on  that  graph  around 
1000  Hz.  In  order  to  convert  between  real  frequency  and  the  value 
of  b  in  (5-57)  and  (5-58)  use  the  formula 


b  =  e 


=  e-2ll£rT 


where  f  is  the  real  frequency  and  T  is  the  sampling  interval 
(in  this  case  T  =  100  usee) . 

These  graphs  have  two  main  properties.  First,  at  any  one 
frequency,  V  is  less  for  a  double  pole  than  for  a  single  pole. 
This  is  to  be  expected  since  the  spectrum  of  the  double  pole  has 
a  larger  dynamic  range  than  that  of  the  single  pole,  and  we  have 
learned  that,  other  things  being  equal,  a  larger  dynamic  range 
results  in  a  lower  V.  Second,  for  each  of  the  two  curves,  as  the 
frequency  of  the  pole  increases,  V  increases.  Again,  this  is  to 
be  expected  since  as  the  pole  frequency  increases  the  dynamic 
range  of  the  corresponding  spectrum  decreases  which  causes  an 
increase  in  V. 


kf=4»y 


SINGLE 

POLES 


FREQJENCY 
(  fs  =  10  kH 


Pig.  5-7.  The  ratio  V  for  singi 
of  the  spectrum. 
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This  concludes  our  exposition  of  the  behavior  of  Vm^n  as 

a  function  of  different  spectral  models.  For  real  speech  spectra 

V  .  must  be  computed  from  (5-42)  in  an  approximate  manner.  This 
mm 

is  discussed  in  the  next  section  where  we  deduce  properties  of 
the  zeroth  quefrency  cn. 


5.4  The  Zeroth  Quefrencj 


It  is  clear  from  (5-41)  that  V  =  — depends  completely  on 

K0 

the  zeroth  quefrency  cQ  and  the  total  energy  Rq  of  the  signal. 
Therefore ,  all  the  properties  of  V  that  were  discussed  in  Section 
5.3  are  actually  a  reflection  on  the  properties  of  c^.  We  shall 
not  repeat  these  properties  here,  but  we  would  like  to  examine 
another  possible  usefulness  of  cQ  in  speech  analysis. 


Given  two  signals  such  that  one  is  a  constant  multiple  of  the 
other,  their  cepstra  are  identical  except  at  the  origin  (i.e.  at 
cQ) ,  This  property  led  Mersereau  and  Oppenheim  (1972)  to  suggest 
the  possibility  of  using  Cq  as  a  measure  of  signal  amplitude.  They 
presented  plots  of  cQ  for  several  utterances  and  compared  them  with 
plots  of  log  R(^.  They  noticed  that  the  two  curves  had  similar  gross 
features  except  that  for  some  fricatives  Cq  had  definite  peaks 
while  log  Rq  did  not.  These  differences  between  cQ  and  log  Rq  can 
be  easily  explained  from  the  properties  of  V.  Indeed,  the  differ¬ 
ence  between  Cq  and  log  Rq  is  simply  given  by 

log  V  =  Cq  ~  log  Rq.  (5-59) 


-  'SB 
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This  difference  can  be  measured  in  dB  if  we  take  10  log^Q  V  in 
which  case: 

10  lo9io  V  =  4.34  cQ  -  10  loglQ  RQ 

or  V(dB)  =  cQ (dB)'  -  RQ (dB)  ,  (5-60) 

where  V(dB)  is  V  measured  in  dB,  RQ(dB)  is  the  energy  measured  in 
dB,  and  CQ<dB)=  4*34cq.  Since  V<1  must  always  hold,  log  V  is 
always  negative  (or  equal  to  zero).  Therefore, 

cQ (dB)  <  Rq (dB)  .  (5-61) 

How  much  Cq (dB)  is  less  than  Rq (dB)  depends  on  the  shape  of  the 
spectrum.  From  the  analysis  in  Section  5.3  it  is  clear  that 
Cq (dB)  could  be  as  much  as  20  dB  less  than  Rq (dB)  for  certain 
sonorants.  On  the  other  hand,  for  some  fricatives  that  difference 
could  be  as  low  as  3  or  4  dB.  This  is  why,  relative  to  the  gen¬ 
eral  trend  of  cQ  versus  log  Rq,  seme  fricatives  were  narked  by 
sharp  peaks.  From  our  experience,  even  within  the  sonorants  chem- 
selves  V(dB)  varied  by  as  much  as  10  dB. 

Our  conclusion  is  that  the  zeroth  quefrency  cQ  indeed  does 

carry  information  concerning  the  energy  in  the  signal,  but  that 

information  is  coupled  with  other  information  about  the  general 

shape  of  the  signal  spectrum.  The  energy  information  can  be 

c0 

factored  out  by  dividing  e  by  Rq ,  leaving  the  information  on  the 

c0 

spectral  shape,  and  that  is  simply  V.  cQ  (more  accurately  e  )  is 
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a  measure  of  the  geometric  mean  of  the  spectrum,  while  RQ  is  a 
measure  of  the  arithmetic  mean.  Thus,  the  information  that  cQ 
carries  about  Rq  is  the  came  information  that  a  geometric  mean 
carries  about  the  corresponding  arithmetic  mean,  no  less  and  no 
more.  The  relation  between  the  two  means  is  represented  by  V. 

Computational  Considerations 

If  Cq  is  to  be  computed  for  a  speech  signal  using  a  digital 
computer,  then  the  integral  of  the  log  spectrum  must  be  approxi¬ 
mated  by  a  summation.  This  is  usually  no  problem,  unless  one  of 
the  spectral  values  happens  to  be  zero.  This  is  most  likely  to 
happen  at  d.c.  especially  since  many  people  remove  the  d.c.  com¬ 
ponent  from  the  signal  before  computing  the  spectrum.  The  prob¬ 
lem,  of  course,  is  that  the  logarithm  of  zero  is  normallv  consid¬ 
ered  to  be  Anything  added  tc  -00  keeps  the  sum  at  -»  and 

will  have  the  value  -°°.  This  result  is  incorrect  since  we  know 
that  the  integral  of  the  log  spectrum  for  any  signal  with  finite 
non-zero  energy  must  always  be  finite.  The  fact  that  the  spectrum 
P(w)  is  zero  at  one  point  (causing  log  P(w)-*-00)  does  not  mean  that 
the  integral  of  log  P(w)  is  also  infinite.  As  a  simple  illustra¬ 
tion,  it  can  be  verified  that 


(5-62) 


log  u  do  =  e  log  e  -  e  . 
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Note  that  at  w=0,  log  but  the  integral  in  ( 5—62)  is  finite 

for  an  arbitrarily  small  e.  In  particular ,  as  e-*0,  the  integral 
approaches  zero,  and  thus  the  fact  that  the  logarithm  is  infinite 
at  o)=0  did  not  contribute  to  the  integral  at  that  point. 

It  should  be  clear  that  the  above  problem  in  computing  cy 
arose  only  because  we  are  approximating  the  integration  by  a 
summation.  Indeed,  if  the  integral  in  (5-62)  is  to  be  approxi¬ 
mated  by  a  summation  and  the  value  at  u>=0  is  used,  the  same  pro¬ 
blem  would  occur.  If  we  assume  that  this  problem  is  likely  to 
arise  only  at  d.c.,  then  a  good  solution  is  to  remove  the  d.c. 
from  the  signal  and  then  ignore  the  spectrum  at  d.c.  in  computing 

c0* 

5.5  Detection  of  Voicing 

In  Section  5.3  we  pointed  out  the  possible  usefulness  of 

the  normalized  error  Vp  as  a  voicing  detector.  This  could  be 

implemented  by  setting  a  threshold  on  the  normalized  error  for 

a  particular  value  of  p.  If  V  is  less  than  the  threshold,  the 

P 

sound  is  judged  to  be  voiced;  otherwise  it  is  judged  to  be  un¬ 
voiced.  For  speech  recorded  in  a  quiet  room  using  a  high  quality 
system,  we  have  found  that  the  normalized  error  can  be  used  in 
this  manner  a  large  portion  of  the  time  for  the  detection  of 
voicing.  (More  precisely,  it  is  useful  in  differentiating  sono- 
rants  from  nonsonorants.  In  the  cases  cf  stops  and  fricatives, 
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the  normalized  error  does  not  work  particularly  well  as  a 
voicing  detector.)  It  should  be  reiterated  that  this  behavior 
of  the  normalized  error  has  nothing  to  do  with  the  fact  of 
voicing  itself,  but  rather  with  the  shapes  of  the  spectra  cor¬ 
responding  to  voiced  vs.  unvoiced  (or  sonorant  vs.  nonsor.orant) 
sounds.  We  will  point  out  some  of  the  common  conditions  under 
which  the  normalized  error  works  less  than  ideally  as  a  voicing 
detector. 

Background  Noise  -  During  stop  gaps  and  other  periods  of  silence, 
the  signal  being  analyzed  is  the  background  noise.  During  these 
periods,  irrespective  of  hov;  low  the  noise  level  is,  the  normal¬ 
ized  error  curve  could  be  low  or  high,  depending  on  the  shape  of 
the  noise  spectrum.  If  the  noise  spectrum  is  rather  flat,  the 
error  curve  will  be  high  and  the  spectrum  will  be  judged  to  be 
unvoiced.  However,  in  many  real  life  situations  there  is  a 
heavy  energy  concentration  at  very  low  frequencies,  which  causes 
the  error  curve  to  be  low  and  may  cause  the  spectrum  to  be 
judged  as  voiced.  A  possible  solution  is  to  high-pass  the  speech 
signal  to  get  rid  of  these  low  frequency  components  (which  are 
usually  below  250  Hz) ,  but  this  filtering  would  also  filter  out 
the  low  frequency  components  in  all  other  sounds  uo  an  undesirable 
extent.  A  better  solution  would  be  to  detect  periods  of  silence 
from  energy  considerations  (e^g.  Rq)  and  then  avoid  making  a 
voicing  decision  based  on  V  during  these  periods. 

It 
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extensive  vocal  communication  which  distorts  the  speech  signal 
in  many  ways.  For  example,  the  energy  below  300  IIz  and  above  3 
kHz  is  filtered  out.  This  keeps  much  of  the  formant  structure 
relatively  untouched,  but  it  filters  out  much  of  the  energy  for 
sonorants  and  fricatives.  This,  in  addition  to  other  important 
factors  (such  as  noise) ,  reduces  the  spectral  dynamic  range  of 
the  signal.  The  overall  effect  on  the  normalized  error  is  that 
it  becomes  higher.  For  some  vowels  the  normalized  error  can  be 
as  much  as  an  order  of  magnitude  higher.  The  result,  of  course, 
is  that  it  becomes  more  difficult  to  use  the  normalized  error 
to  differentiate  between  voiced  and  unvoiced  sounds. 

Effects  of  Preemphasis  -  Preemphasis  is  often  used  in  speech  ana¬ 
lysis  to  compensate  for  the  spectral  slope  of  voiced  sounds,  which 
falls  at  6  dB/octave  or  more.  In  the  digital  domain,  preemphasis 
is  conveniently  accomplished  by  differencing  the  signal  (i.e.  sub¬ 
tracting  adjacent  samples) .  Ue  shall  go  into  some  detail  on  the 
properties  if  differencing  and  its  effects  on  the  normalized  error. 
Some  of  these  properties  will  be  useful  in  the  next  chapter  on 
formant  extraction. 
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Let  the  first  difference  of  the  signal  sn  be  defined  by: 
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where  is  the  differenced  signal  and  d  is  an  operator  that 
takes  the  ith  difference  of  its  argument. 

Taking  the  z-transfcrm  of  (5-63)  we  obtain: 


S'(z)  =  (l-z""1)  S  (z)  =  D (z)  S  (z) 


where 


D  (2)  =  1"Z 


-1 


(5-64) 


(5-65) 


is  the  differencinc  operator  in  the  frequency  domain.  It  intro¬ 
duces  a  digital  zero  at  z=l,  which  corresponds  to  zero  frequency. 
The  power  spectrum  of  the  differenced  signal  is: 

P'  (u)  =  |  S  ’  (oj)  |  2  =  |D(o>)  |2  P(w) 


=  |1  -  e“jwTj2  P(u) 


=  4  sin' 


ojTA  . 

— *JT  P(u»)  , 


where 


D(w)  |  =  2  sin 


cuT) 

”2 


(5-66) 

(5-67) 


is  the  magnitude  of  the  frequency  response  of  the  differencing 

operator.  Therefore,  the  effect  of  differencing  in  the  time  do- 

2 


,  which  is 


main  is  to  multiply  the  power  spectrum  by  4  sin 
the  spectral  response  of  the  zero  z=l.  Figure  5-8  shows  a  plot 
of  1d(w)  (  in  (5-67)  versus  uT.  (wT  =  it  corresponds  to  half  the 
sampling  frequency,  which  would  be  5  kHz  for  a  10  kHz  sampled 
signal.)  Also  shown  in  Fig.  5-8  is  a  plot  of  the  transfer  func¬ 
tion  for  the  analog  zero  at  zero  frequency.  The  analog  zero 
corresponds  to  differentiation  in  the  continuous  time  domain. 
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wT 


Fin;.  5-8  The  frequency  response  of  a  digital  zero  at  z=l 
as  compared  to  the  corresponding  analor  zero 
at  zero  frequency. 


The  response  of  the  analog  zero  climbs  at  6  dB/octave  for  all 
frequencies.  The  response  of  the  digital  zero  climbs  at  6  dB/ 
octave  at  low  frequencies,  but  becomes  flat  at  wT  =  tt.  Between 
wT  =  2  and  wT  =  tt  (which  corresponds  to  the  octave  2.5  kHz  to 
5  kHz)  there  is  a  rise  of  only  3  dB.  At  u>T  =  it,  the  digital  re¬ 
sponse  is  3.92  dB  lower  than  the  analog  response. 

Therefore,  differencing  greatly  attenuates  the  energy  at 
very  low  frequencies  and  enhances  the  energy  at  high  frequencies. 
These  major  effects  on  the  shape  of  the  spectrum  have  strong 
effects  on  the  normalized  error  curves.  As  an  example.  Fig.  5-9 
shows  the  error  curves  for  the  same  two  signals  shown  in  Fig.  5-2 
except  that  in  this  case  the  signals  were  preemphasized  by 
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differencing.  The  error  curve  for  the  unvoiced  fricative  [s] 
became  lower  while  that  for  the  vowel  [a]  became  much  higher,  so 
much  so  that  the  [a]  curve  starts  higher  than  the  [s]  curve,  but 
as  p*00,  V  .  for  [a]  becomes  lower  than  V  .  for  [s],  (This  means 

iUXil  illXIl 

that  the  two  curves  must  have  crossed  at  some  point.  In  this 
case  the  curves  cross  at  p  =  122.)  In  general,  preemphasis  causes 
a  marked  increase  in  the  value  of  the  normalized  error  for  sono- 
rants.  The  effects  of  preemphasis  on  unvoiced  sounds  such  as 
stop  bursts  and  fricati/es  are  less  predictable;  the  normalized 
error  could  go  either  up  or  down  depending  on  the  particular 
spectrum.  These  effects  can  be  understood  better  by  examining 
how  the  autocorrelation  coefficient  is  affected  by  differencing 
the  signal,  and  then  using  Fig.  5-4  to  make  statements  about  the 
behavior  of  V^,  which,  as  we  have  argued  before,  is  a  good  indi¬ 
cation  of  the  general  level  of  the  error  curve. 

As  we  pointed  out  in  Section  5.3,  is  the  result  of  a  co¬ 
sine  weighting  on  the  spectrum  which  weights  low  frequencies  posi¬ 
tively  and  high  frequencies  negatively.  Since  preemphasis  attenu¬ 
ates  low  frequencies  and  emphasizes  high  frequencies,  the  effect 
is  to  lower  the  value  of  R^  relative  to  Rq,  i.e.  to  lower  r^. 

From  Fig.  5-4,  decreasing  r1  could  either  increase  or  decrease  V, 
depending  on  the  value  of  r^  and  how  much  it  decreases.  Most 
sonorants  have  r^>.9,  and  differencing  causes  a  decrease  of  between 
.1  and  .7  so  that  the  resulting  r-j_  is  still  greater  than  zero. 
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From  Fig.  5-4  we  see  that  always  increases  for  this  case.  For 
sounds  such  as  [s]  where  r^<0,  decreasing  r^  decreases  V^.  How¬ 
ever,  for  other  unvoiced  sounds  where  0<r^<.5,  decreasing  r.^  could 
either  increase  or  decrease  depending  on  how  much  ^  is  de¬ 
creased.  The  general  impression  that  one  gets  upon  monitoring  the 
normalized  error  is  that  preemphasis  by  differencing  makes  the 
normalized  error  an  unreliable  measure  of  voicing. 

Computing  the  values  of  the  autocorrelation  function  for  the 
differenced  signal  (e.g.  in  order  to  see  the  effect  of  differencing 
on  r-^)  is  possible  from  the  autocorrelation  of  the  undifferenced 
signal.  Let  Rk  be  the  autocorrelation  function  of  the  differenced 
signal.  Then,  by  definition: 


H.  “  Z  sn  3  A- 


(5-68) 


Substituting  (5-63)  in  (5-68)  we  obtain: 


i  -  Z  (sn  -  sn-l>  (3n+k  ‘  3n*k-l> 


-  E  <3„ 


sn+k  "  sn  sn+k-l  "  Sn-1  Sn+k  +  Sn-1  sn+k-l) 


Rk  -  ®k-l  -  \+l  +  Rk 


and  Rk  -  2Rk  -  Rk_1  -  Rk+1  # 


(5-69) 
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or  \  =  -I(Rk+i  -  Rk>'(Rk  -  \-i)] 

*  -WC'k.i'-dtV 

and  =  -d2!!^)  •  (5-70) 

(5-70)  says  that  the  autocorrelation  of  a  differenced  signal  is 
equal  to  the  negative  of  the  second  difference  of  the  autocorrela¬ 
tion  of  the  original  signal.  This  result  is  analogous  to  the 
analog  domain  property  that  the  autocorrelation  of  a  differenti¬ 
ated  continuous  signal  is  equal  to  the  negative  of  the  second 
derivative  of  the  autocorrelation  of  the  original  signal  (see  for 
example ,  Papoulis,  1965,  p.  317).  As  an  example,  rj^  for  the  dif¬ 
ferenced  signal  is  equal  to: 


(Remember  that  R_k  =  Rk,  for  all  k.) 


5.51  Using  r^,  as  a  Voicing  Detector 

It  has  become  clear  that  what  makes  the  normalized  error  a 


good  voicing  detector  for  high  quality  speech  is  the  fact  that 


most  voiced  sounds  have  a  high  energy  concentration  at  low  fre¬ 
quencies  while  unvoiced  sounds  have  the  energy  more  spread  out 
or  partly  concentrated  at  high  frequencies.  This  spectral  bal~ 
ance,  when  disturbed  (e.g.  by  preemphasis)  causes  the  normalized 
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error  to  be  an  unreliable  voicing  detector.  We  have  explained 
some  of  the  reasons  for  this  above,  where  we  appealed  to  an  analy¬ 
sis  in  terms  of  r-^  and  its  effect  on  V^.  In  particular,  we  ob¬ 
served  that  for  a  differenced  signal  r^<r1#  while  the  value  for 
had  no  such  consistent  relation.  This  suggests  the  use  of  r^ 

as  a  voici  .g  detector. 

For  an  unpreprocessed  signal,  r^  should  work  as  well  as  the 
normalized  error.  From  the  limited  data  we  have  examined  for  a 
single  male  speaker,  r-^.8  for  voiced  sounds  and  r^<.6  for  unvoiced 
sounds  worked  very  well  as  a  voicing  detector.  Furthermore,  when 
the  speech  signal  was  preemphasized  by  differencing  we  noted  that 
rj^  was  always  less  than  r1#  but  the  amount  changed  with  the  particu¬ 
lar  sound.  Front  vowels  exhibited  a  large  drop  as  might  be  expec¬ 
ted.  (For  example,  one  [i]  sound  had  r^  =  .95  and  rj,  =  .2.)  How¬ 
ever,  most  sounds  remained  separable  between  voiced  and  unvoiced, 
although  we  do  not  expect  the  reliability  to  be  as  high  as  with 
r^.  If  preemphasis  is  performed  before  the  signal  is  digitized 
then  one  could  just  use  r|.  However,  if  the  signal  is  to  be  dif¬ 
ferenced  digitally,  one  need  not  use  r^;  r^  would  still  be  avail¬ 
able  and  relatively  cheap  computationally;  all  that  is  needed  is 
to  compute  Rq  and  from  the  original  signal  before  it  is  dif¬ 
ferenced. 

There  is  nothing  sacred  or  magical  about  using  the  normalized 
error  or  r^  as  a  voicing  detector,  especially  if  the  signal  was 
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processed  in  some  special  way.  In  that  case  one  could  perform  a 
suitable  weighting  on  the  spectrum  and  get  a  measure  that  would 
correlate  well  with  voiced  or  unvoiced  sounds,  r^  uses  a  cosine 
weighting;  this  is  only  one  of  an  infinite  number  of  different 
weightings  that  could  be  used.  Furthermore,  no  single  method  for 
the  detection  of  voicing  will  work  all  the  time.  It  is  normally 
advisable  to  have  at  least  two  methods  at  hand,  and  the  two  should 
be  based  on  different  properties  of  the  signal. 

5.6  Optimum  Number  of  Predictor  Coefficients 

It  was  stated  in  Section  4.3  that  for  certain  applications  we 
wish  to  approximate  the  envelope  of  the  signal  spectrum  P (w)  by 

A 

an  all-pole  spectrum  P(w)  whose  parameters  are  the  predictor  coef¬ 
ficients  a^,  l5k<p.  Also,  we  were  assured  that  by  minimizing  the 

A 

error  in  (4-16)  we  obtain  a  spectrum  P(w)  which  (for  some  p)  is  a 
good  estimate  of  the  spectral  envelope  of  P(w).  The  question  that 

A 

remains  is  for  what  value  (s)  of  p  will  P(u>)  indeed  be  a  good  spec¬ 
tral  envelope.  We  know  that  such  a  value  of  p  (or  range  of  values) 

A 

must  exist,  because  for  very  low  values  of  p,  P(w)  is  a  very  crude 

A 

fit  to  P(w),  while  as  p-*-®,  P(u)  becomes  identical  to  P(oj).  Some¬ 
where  in  between  there  should  be  a  value  of  p  that  would  be  satis¬ 
factory  for  a  good  envelope  fit.  In  Section  2.4  we  obtained  a 
rough  idea  of  what  p  should  equal  for  some  sounds  from  theoretical 
considerations.  Here  we  shall  give  an  empirical  method  to  determine 
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the  optimum  value  of  p  for  each  soun* 

Figures  5-2 ,  5-5  and  5-9  show  error  curves  corresponding 
to  different  spectra.  Each  of  the  error  curves  starts  at  1  for 
p=0  and  monotonically  decreases  to  *.ts  own  V  ^  as  p-’-co.  Also, 
each  of  the  curves  exhibits  what  might  be  called  the  "knee"  of 
the  curve.  This  is  a  region  of  the  curve  after  which  the  curve 
slopes  very  slowly  towards  its  asymptote.  For  example,  in 
Fig.  5-2,  starting  at  p=7  for  [s]  and  at  p=ll  for  [ae]  ,  the  error 
curve  falls  off  gently.  Our  physical  explanation  ft.'  t^s  "knee" 

A 

is  that  around  that  value  of  p  the  approximate  spe  „• 'urn  P(w)  is 
the  optimum  approximation  to  the  envelope  of  the  signal  spectrum 
P (w) .  a  lower  value  of  p  results  in  a  grosser  approximation  to 
the  spectral  envelope  while  a  larger  value  of  p  will  superimpose 
fine  structure  information  on  the  spectral  envelope.  This  ex¬ 
planation  is  based  on  the  properties  of  the  error  measure  (4-16) 
which  were  discussed  in  Section  4.3. 

Therefore,  for  each  frame  of  the  signal  one  could  find  the 
knee  of  the  error  curve  and  choose  the  optimal  value  of  p  as  that 
place  where  the  error  curve  begins  to  fall  off  slowly  towards  its 
asymptote.  This  method  is,  of  course,  quite  approximate.  It 
should  be  clear  that  the  optimal  value  of  p  will  vary  a  gocd  deal 
depending  on  the  particular  sound.  For  many  appli.cations  this 
process  is  cumbersome  and  a  fixed  value  of  p  would  be  more  desir¬ 
able.  In  general,  increasing  p  beyond  its  optimal  value  has  a 
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a  less  drastic  effect  than  if  p  is  decreased.  Therefore  it  is 
usually  sufficient  to  set  p  to  a  fixed  value  that  is  the  upper 
limit  necessary  to  describe  the  spectral  envelopes  of  the  dif¬ 
ferent  sounds  in  the  signal.  For  speech  signals  bandlimited  to 
5  kHz  and  sampled  at  10  kHz,  a  value  of  p  between  10-14  is  chosen 
depending  on  the  application.  This  agrees  with  the  speech  pro¬ 
duction  considerations  of  Section  2.4. 

In  Section  6.2  the  above  results  will  be  extended  to  other 
linear  prediction  methods,  and  will  be  useful  in  determining  the 
value  of  p  which  leads  to  accurate  formant  information. 
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CHAPTER  VI 
FORMANT  ANALYSIS 


AND  PITCH  EXTRACTION 


I 

I 

I 

9 

8 


In  an  analysis-synthesis  system  based  on  linear  prediction, 
the  synthesis  part  of  the  system  is  normally  based  on  the  speech 
production  model  shown  in  Fig.  2-1.  We  have  discussed  in  Chapters 
III  and  IV  several  methods  for  the  computation  of  the  predictor 
parameters.  In  Chapter  V  we  discussed  methods  for  the  detection 
of  voicing.  One  important  remaining  parameter  is  the  pitch  t, 
for  those  sounds  judged  to  be  voiced.  We  define  pitch  to  be  the 
time  interval  between  consecutive  glottal  pulses.  The  instanta¬ 
neous  fundamental  frequency  pQ  is  then  defined  as  the  inverse  of 

1 

the  pitch,  Fq  =  j.  The  first  section  in  this  chapter  discusses 
briefly  methods  of  pitch  extraction  (estimation)  based  on  linear 
prediction.  It  should  be  emphasized  that  the  discussion  in  this 
chapter  applies  to  both  the  Covariance  and  Autocorrelation  methods 
of  linear  prediction. 

For  other  applications ,  such  as  formant-based  synthesis  and 
speech  recognition,  it  is  desired  to  estimate  the  formants  of  the 
vocal  tract  as  well.  The  formants  are  estimated  from  the  poles 
of  S(z)  in  the  speech  production  model.  The  extent  to  which  the 
formant  values  thus  obtained  reflect  the  actual  resonances  of  the 
vocal  tract  depends  on  several  factors.  We  discuss  the  adequacy 
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of  the  all-pole  model  for  formant  extraction  (estimation) ,  the 
effect  of  the  number  of  poles  p  in  S(z) ,  the  dependency  on  the 
specific  method  of  linear  prediction  used,  and  the  importance  of 
the  signal  frame  width  and  frame  positioning.  The  last  factor 
is  discussed  in  terms  of  pitch-synchronous  and  pitch-asynchro¬ 
nous  analysis.  A  discussion  of  windowing  is  included  in  pitch- 
asynchronous  analysis. 

Finally,  we  discuss  peak  picking  of  the  linear  prediction 
spectrum  as  a  means  of  formant  extraction.  Preemphasis  of  the 
speech  signal  and  computing  the  spectrum  along  a  contour  inside 
the  unit  circle  are  suggested  as  two  efficient  and  effective  me¬ 
thods  to  improve  the  performance  of  peak  picking  in  formant 
extraction. 

6.1  Pitch  Extraction 

If  we  assume  that  the  model  in  Fig.  2-1  is  accurate  for  the 
production  °f  voiced  speech,  then  by  passing  the  speech  signal 
s(nT)  into  a  filter  that  is  the  inverse  of  S(z),  we  should  ob¬ 
tain  a  signal  that  is  close  to  u(nT),  which  consists  of  a  se¬ 
quence  of  impulses.  Except  for  the  gain  factor  A,  the  filter 

a 

H(z)  defined  in  (2-3)  is  the  inverse  filter  to  S(z).  From  Fig.  4-1 
we  see  that  passing  the  signal  s(nT)  through  the  filter  H(z)  pro¬ 
duces  the  error  signal  e(nT).  Therefore,  e(nT)  should  be  related 
to  u(nT)  by  a  multiplicative  constant  for  any  one  frame  of  speech. 
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i.e,  e(nT)  &  A  u{nT) The  error  signal,  then,  should  exhibit 
impulses  corresponding  to  the  pitch  pulses.  The  separation  of 
these  pulses  in  time  would  then  be  the  pitch  period,  whose  in¬ 
verse  is  the  instantaneous  fundamental  frequency  FQ. 

After  the  predictor  coefficients  a.  are  computed  by  any  de- 

A 

sired  method,  the  error  signal  e(nT)  is  obtained  from  the  original 
signal  by  using  (3-2) ,  which  is  repeated  here: 


e 


n 


5n-k 


(6-1) 


efi  is  simply  the  difference  between  the  original  signal  and  f^e 
predicted  signal.  It  is  a  measure  of  the  inaccuracy  in  assuming 
a  linear  prediction  model.  In  the  direct  Autocorrelation  method 
the  original  signal  is  usually  windowed  before  the  coefficients 
are  computed.  In  that  case  (6-1)  could  be  applied  either  to  the 
windowed  signal  or  to  the  original  signal.  In  the  direct  Auto¬ 
correlation  method  windowing  is  necessary  in  order  to  obtain  better 
estimates  ot  the  coefficients  a^.  However,  once  this  is  done,  the 
computed  coefficients  ak  are  supposed  to  apply  to  the  original 
signal  as  well. 

Although  the  coefficients  ak  are  computed  from  a  specific 
frame  of  the  signal,  one  could  compute  (6-1)  for  a  time  interval 
that  is  larger  than  that  used  for  computing  the  coefficients  a^. 
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In  a  quasi-steady-state  situation,  the  same  coefficients  a, 

K 

should  continue  to  apply  to  a  portion  larger  than  the  frame  used 
for  the  analysis. 


Figure  6-1  shows  examples  of  error  signal  analysis  using 
the  direct  Autocorrelation  method  for  four  types  of  voiced  speech 
segments.  Each  example  shows  three  signals,  each  25.6  msec  long. 
s(nT)  is  the  original  signal.  The  predictor  coefficients  ak  are 
computed  from  a  Hamming-windowed  s(nT),  then  the  error  signal  e(nT) 
in  the  figure  is  obtained  by  applying  (6-1)  to  the  original  un¬ 
windowed  signal  s(nT).  Re(nT)  is  the  autocorrelation  of  e(nT). 

In  Fig.  6-1,  e(nT)  is  normalized  with  respect  to  the  maximum 
error  in  the  frame.  Also,  the  first  p  values  of  e(nT)  have  been 


set  to  zero  since  e  (pT)  is  the  first  value  we  compute.  R  (nT)  is 
normalized  with  respect  to  the  maximum  value  in  the  frame  other 
than  Re (0) ,  which  is  known  to  be  greater  than  all  other  autocorre¬ 
lation  coefficients.  In  fact,  Rg(0)  is  not  shown  in  the  examples 
in  Fig.  6-1, 

In  comparison  with  Fig.  6-1,  Fig.  6-2  shov/s  the  error  auto¬ 
correlation  functions  for  the  same  four  frames,  except  that  the 
error  signal  is  obtained  from  the  windowed  signal.  The  computa¬ 
tions  were  performed  in  the  frequency  domain  as  follows: 


R^ (nT) 


/  iECuii2 

-x/T 


i  njj'i  . 

e-  dw 


(6-2) 
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Fig.  6-1.  Analysis  of  error  signal  for  pitch  extraction, 

(a)  The  vowel  [»3  in  "potassium", 

(b)  The  liquid  [r]  in  "rubidium". 
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Fig.  6-1.  (Cont'd)  Error  signal  analysis  for  pitch 
extraction. 

(c)  The  [ a ]-[ s ]  transition  in  "potassium". 

(d)  The  voicing  in  the  voiced  stop  [b]  in 
"rubidium" . 
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Fig.  6-2.  Error  autocorrelation  functions  R*(nT)  for  the 
same  four  frames  shown  in  Fig. 6-1,  except  thac 
the  error  signal  here  is  obtained  from  the 
windowed  signal. 


147 


Report  No.  2304 


Bolt  Beranek  and  Newman  Inc. 


and 


I E  (u>)  j2  =  |S(oj)  |2  J II  (o>)  j2 


A  P(w) 


(6-3) 


where  P(w)  is  the  power  spectrum  of  the  windowed  signal,  P (w)  is 
the  power  spectrum  corresponding  to  S  (w) ,  and  A  is  the  minimum 
error  in  (3-37).  P(co)  and  P(u)  are  computed  via  the  FFT,  as  des¬ 
cribed  in  Appendix  C.  Then,  (6-2)  is  computed  by  an  inverse  FFT. 
(Note  that  if  the  speech  signal  is  N  samples  long,  then  one  should 
append  at  least  an  equal  number  of  zeros  and  compute  2N-point  FFT's, 
ir.  order  to  obtain  the  complete  autocorrelation  function.) 

We  mentioned  earlier  that  the  error  signal  e(nT)  for  a  voiced 
sound  should  exhibit  impulses  that  correspond  to  the  pitch  pulses. 
The  error  signal  in  Fig.  6-la  shows  a  typical  case  where  the  pro¬ 
minent  peaks  can  be  associated  with  pitch  pulses.  The  correspond¬ 
ing  error  autocorrelation  function  shov/s  a  sharp  peak  at  a  lag 
equal  to  the  pitch  period.  Although  Fig.  6- la  is  quite  typical 
for  many  voiced  sounds,  there  exist  a  number  of  important  excep¬ 
tions.  Fig.  6-lb  shov/s  an  error  signal  with  more  than  one  peak 
within  a  single  pitch  period.  (The  prominent  peak  is  associated 
with  excitation  due  to  closing  of  the  glottis  while  the  secondary 
peak  in  the  middle  of  the  pitch  period  can  be  associated  with  ex¬ 
citation  due  to  the  opening  of  the  glottis.)  The  error  autocorre¬ 
lation  in  Fig.  6-lb  still  shov/s  a  prominent  peak  at  the  pitch 
period. 
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An  important  case  is  shown  in  Fig.  6-lc  during  a  vowel-con¬ 
sonant  transition.  As  the  voicing  decays,  the  pitch  pulses  seem 
to  disappear.  The  same  is  true  during  consonant- vowel  transitions. 
During  both  types  of  transitions  the  sound  is  clearly  voiced,  yet 
the  error  signal  does  not  show  any  prominent  peaks  that  could  be 
associated  with  pitch  pulses.  Fig.  6-ld  shows  the  same  phenomenon 
during  the  voicing  in  a  voiced  stop.  (Note  that  e(nT)  in  each  ex¬ 
ample  has  been  normalized  to  the  maximum  error  in  that  frame.  That 
is  why  e(nT)  in  Fig.  6-ld  seems  to  be  excessively  large  compared 
to  the  other  examples;  in  reality  it  is  not.)  The  above-mentioned 
cases  have  in  common  the  fact  that  the  signal  is  not  rich  in  har¬ 
monics  as  is  normally  the  case  during  sustained  vowels.  Another 
way  of  stating  this  is  that  the  signal  tends  to  become  sinusoidal 
in  nature  in  those  cases.  This  is  very  evident  for  s(nT)  in 
Fig.  6-ld.  Now,  the  linear  prediction  model  works  very  well  for 
sinusoidal  signals.  In  fact,  a  pure  sine  wave  can  be  generated 
digitally  with  each  sample  being  equal  to  a  linearly  weighted 
summation  of  the  proceeding  two  samples,  and  this  can  go  on  inde¬ 
finitely  in  time.  Therefore,  for  a  sine  wave,  the  linear  predic¬ 
tion  error  signal  would  be  zero  for  all  time  (except  for  the  very 
first  sample) ,  and  there  would  exist  no  pulses  to  delineate  pitch 
periods.  The  implication  for  cases  such  as  in  Figs.  6-lc  and 
6-ld  is  that  the  error  signal  ceases  to  be  a  good  source  for  mea¬ 
suring  pitch.  All  is  not  lost,  however,  because  pitch  can  now  be 
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estimated  from  the  signal  s  (nT)  itself,  since  it  is  quasi-sinusoi- 
dal.  This  can  be  done  by  any  number  of  ways,  including  peak 
picking  of  the  signal  itself  or  its  autocorrelation.  (Note  in 
Fig.  6-ld  that  although  e(nT)  is  very  erratic,  the  autocorrelation 
Re(nT)  still  exhibits  a  peak  at  the  pitch  period.)  It  is  clear 
from  Fig.  6-2  that  the  autocorrelation  of  the  error  signal  ob¬ 
tained  from  the  windowed  speech  signal  can  also  be  used  for  pitch 
extraction. 

In  summary,  pitch  can  be  extracted  in  most  cases  from  either 
the  error  signal  or  its  autocorrelation.  In  cases  where  the  speech 
signal  is  not  rich  in  harmonics,  pitch  can  be  extracted  directly 
from  the  speech  signal  or  its  autocorrelation.  The  combination 
of  methods  to  use  depends  on  the  properties  of  the  signal  as  well 
as  on  the  specific  application. 

The  examples  shown  in  Fig.  6-1  were  obtained  using  the  Auto¬ 
correlation  method.  The  same  sounds  when  analyzed  using  the  Co- 
variance  method  did  not  show  any  significant  deviation  in  the 
error  signal  or  its  autocorrelation.  This  was  also  true  for  all 
the  sounds  we  have  examined  thus  far. 

6.2  Formant  Analysis 

In  an  analysis-synthesis  system  using  linear  prediction,  where 
the  synthesizer  is  of  the  form  shown  in  Fig.  2-lb,  it  is  necessary 
to  know  the  values  of  the  predictor  coefficients  a^,  but  it  is 
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not  necessary  to  know  the  poles  of  the  filter  S(z)  which  is  shown 
in  F'g.  2-la  and  given  below  (except  perhaps  to  check  for  possible 
instability  of  the  filter) : 


S(z)  = 


(6-4) 


-I 

k=l 


ak  2 


-k 


However,  for  applications  such  as  speech  recognition  and  formant 
synthesis,  it  is  necessary  to  compute  the  poles  of  S(z)  in  order 
to  be  able  to  deduce  the  formants  of  the  vocal  tract.  The  poles 
of  S (z)  can  be  computed  by  setting  the  denominator  of  S(z)  in 
(6-4)  to  zero  and  solving  the  resultant  polynomial  equation  in 
z  for  its  roots.  (We  have  successfully  used  a  variation  on  the 
POLRT  routine  in  the  IBM  Scientific  Subroutine  Package,  1968.  The 
variations  included  elimination  of  ail  double  precision  computa¬ 
tions,  raising  error  tolerances,  and  modifying  the  starting  point 
for  each  root  to  be  a  random  point  on  the  unit  circle.)  Since 
the  coefficients  a^  are  real,  some  or  none  of  the  roots  are  real 

and  the  rest  are  complex  conjugate  pairs.  Conversion  to  the  s-plane 

s  V 

can  be  achieved  by  setting  each  root  Zj.  =  e  ,  where  s^  =  +  ;jwk 

is  the  corresponding  pole  in  the  s-plane.  If  the  root  z^  =  z^r  + 
jzki,  then: 

u)^  =  f s  arctan 
f. 


z,  . 
ki 


kr 


-  y-  log  (z2r  +  zki) 


(6-5) 

(6-6) 
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where  f  is  the  sampling  frequency. 

In  the  s-plane  the  poles  will  also  be  either  real  or  in  complex 
conjugate  pairs. 

If  the  speech  spectrum  can  be  approximated  by  poles  only, 
then  the  formants  can  be  obtained  from  the  poles  of  S(z)  by  noting 
that: 

a)  A  formant  consists  of  a  pair  of  complex  conjugate  poles. 

b)  A  formant  normally  has  a  high  ratio  between  its  frequency 
and  bandwidth.  Complex  conjugate  poles  with  very  wide 
bandwidths  can  be  regarded  an  contributing  to  general 
spectral  shaping  only. 

c)  The  frequency  range  of  a  particular  formant  is  usually 
known . 

d)  Peak  picking  can  be  performed  on  the  approximate  spectrum 
as  a  double  check  on  the  formant  values. 

e)  Continuity  of  formant  values  from  one  spectral  frame  to 
another  can  always  be  invoked,  keeping  in  mind  that  very 
fast  formant  transitions  do  exist  in  speech. 

The  extent  to  which  the  formant  values  thus  obtained  reflect  the 
actual  resonances  of  the  vocal  tract  depends  on  at  least  the 
following  factors: 

a)  Adequacy  of  the  all-pole  model. 

b)  Number  of  poles  p. 
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c)  Method  of  analysis  (e.g.  Autocorrelation  or  Covariance 
method) . 

d)  Frame  width:  number  of  samples  in  one  frame  of  the  sig¬ 
nal;  and  frame  positioning  (e.g.  whether  pitch  synchro¬ 
nous  or  asynchronous#  etc.). 

Ideally#  these  factors  would  be  taken  into  consideration  separately 
for  each  frame  of  interest.  However,  this  can  be  very  expensive 
computationally#  so  in  practice#  tradeoffs  are  made  between  cost 
and  reliability  of  the  desired  results.  We  shall  discuss  briefly 
the  above-mentioned  factors  and  point  out  some  of  these  tradeoffs. 

i 

We  wish  to  emphasize  here  that  the  discussion  below  applies 
to  the  Covariance  as  well  as  the  Autocorrelation  method#  unless  spe 
cifically  stated  otherwise. 

6.21  Adequacy  of  the  All-Pole  Model 

This  issue  has  already  been  discussed  in  Section  2.3.  We 
have  argued  there  that  the  all-pole  model  seems  be  quite  ade¬ 
quate  for  speech  synthesis.  The  question  here  is  the  adequacy  of 
the  model  for  formant  extraction.  For  the  purposes  of  speech  re¬ 
cognition,  for  example#  one  would  ideally  want  to  be  able  to  com¬ 
pute  the  transfer  function  of  the  vocal  tract.  This  means  that 
the  antiformants  as  well  as  the  formants  may  be  needed.  It  is 
reasonable  to  assume  that  the  all-pole  model  would  be  adequate  for 
formant  extraction  of  vowels.  (This  assumption  is  based  on  another 
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assumption,  namely  that  the  glottal  spectrum  and  radiation  can  be 
approximated  by  poles  only.)  However,  for  sounds  such  as  nasals 
and  fricatives,  whose  spectra  are  known  to  have  antiformants,  the 
all-pole  model  might  not  yield  accurate  results  for  the  resonances 
of  the  vocal  tract.  Figure  6-3  shows  the  signal  spectrum  and  the 
linear  prediction  spectrum  (p=14)  for  the  second  [n]  in  the  word 
■•anyone"  for  a  male  speaker.  The  problem  in  looking  at  a  spectrum 
like  this  is  in  deciding  where  the  formants  and  anti formants  are. 
There  is  no  good  way  of  making  this  decision  in  general,  unless 
one  has  some  knowledge  about  the  system  that  produced  the  signal 
whose  spectrum  is  under  analysis.  In  fact,  the  spectral  fit  in 
Fig.  6-3  is  very  adequate,  and  it  is  quite  reasonable  to  assume 
that  some  all-pole  system  has  those  characteristics.  However, 
from  our  knowledge  of  the  acoustics  of  the  human  speech  production 
system,  we  know  that  if  the  spectrum  in  Fig.  6-3  is  that  of  the 
sound  [n],  it  must  have  zeros  as  well  as  poles.  But  even  if  we 
knew  this,  how  would  the  linear  prediction  all-pole  approximation 
help  us  in  determining  the  values  of  the  formants  and  antiformants? 
Some  of  the  poles  will  correspond  approximately  to  nasal  formants, 
which  can  be  obtained  as  described  earlier  in  this  section,  but  we 
know  of  no  simple  manner  in  which  the  anti  formants  can  be  determined 
from  the  poles  of  the  linear  prediction  spectrum.  The  problem  is 
that  the  same  poles  must  approximate  the  effects  of  both  the  for¬ 
mants  and  the  antiformants.  This  is  clear  from  the  fact  that  the 
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linear  prediction  spectral  matching  process  performs  equally  well 
at  all  frequencies  irrespective  of  the  shape  of  the  speech  spec¬ 
tral  envelope  (see  Section  4.3).  Another  consequence  is  that  the 
positions  and  more  so  the  bandwidths  of  the  extracted  formants 
will  often  be  very  different  from  their  *' actual"  values,  depending 
on  the  position  of  each  formant  with  respect  to  the  anti formants. 
Formants  that  are  far  from  the  nearest  antiformant  are  well  appro¬ 
ximated,  while  those  that  are  dose  to  an  antiformant  are  often 
poorly  approximated.  A  formant  that  is  close  to  an  antiformant 
can  appear  as  a  very  wide-bandwidth  peak  which  might  go  undetected. 
With  nasals,  the  first  formant  is  normally  well  approximated  since 
it  is  separated  from  the  nearest  anti formant  by  at  least  one  other 
formant.  Other  extracted  formants  may  or  may  not  be  reliable  de¬ 
pending  on  the  speaker  and  the  particular  sound  (i.e.  in  general 
unreliable).  For  example,  in  Fig.  6-3  the  first  and  second  for¬ 
mants  seem  to  be  adequately  approximated.  The  third  peak  at  2.6  kHz 
is  probably  the  fourth  nasal  formant.  Between  the  second  and 
fourth  formants  there  should  be  a  formant  cluster,  i.e.  a  cluster 
of  two  formants  and  one  antiformant  (see  Section  2.4).  The  anti¬ 
formant  may  be  around  1.8  kHz,  but  it  is  nou  clear  where  the  two 
formants  are  exactly. 

Analysis  of  fricatives  run  into  the  same  problems  as  nasals, 
if  one  is  interested  in  determining  the  zeros  as  well  as  poles. 

At  least  the  first  two  formants  are  heavily  damped  for  all  frica¬ 
tives,  due  to  neighboring  anti formants .  Pronounced  formant  peaks 
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at  mid  to  high  frequencies  (2.5  -  6  kHz)  occur  for  [s]  and  [§] 
only  (Heinz  and  Stevens,  1961);  these  formants  are  usually  attain¬ 
able  by  linear  prediction.  Also,  certain  stop  bursts,  especially 
that  of  [k],  are  well  represented.  However,  there  is  always  the 
problem  of  pairing  the  formant  peaks  with  the  formant  numbers, 
i.e.  whether  a  particular  peak  corresponds  to  the  third  formant 
or  the  fourth  formant,  etc.  This  problem  can  be  particularly 
important  in  speech  recognition. 

We  have  assumed  in  much  of  the  above  that  one  is  interested 
in  extracting  most  of  the  formants  and  antiformants  for  a  parti¬ 
cular  sound.  However,  for  speech  recognition,  all  of  this  might 
net  be  necessary.  For  example,  given  a  relatively  weak  voiced 
sound  with  a  formant  structure,  such  that  the  first  formant  is 
very  low,  and  the  spectral  transitions  to  and  from  this  sound  are 
abrupt,  one  can  safely  recognize  that  as  a  nasal  much  of  the  time. 
Formant  transitions  to  or  from  this  nasal  could  then  be  used  to 
determine  the  place  of  articulation  of  the  nasal.  All  this  can 
be  done  without  knowing  whether  there  are  zeros  or  not  in  the 
spectrum  under  analysis.  Similar  considerations  exist  for  the  re¬ 
cognition  of  fricatives.  However,  a  major  problem  arises  with 
nasalized  vowels.  The  introduction  of  zeros  into  a  vowel  spectrum 
can  be  disastrous.  The  reason  is  that  wc  depend  heavily  on  the 
exact  positions  and  the,  bandwidths  of  the  extracted  formants  for 
the  recognition  of  the  vowel,  and  the  introduction  of  zeros  plays 
havoc  with  the  real  formant  frequencies  and  bandwidchs.  We  know 


of  no  good  solution  for  this  problem  using  the  linear  prediction 
model. 

In  the  above  wo  have  seen  that  the  linear  prediction  model 
is  inadequate  for  the  extraction  of  formants  and  anti formants 
from  a  spectrum  containing  zeros  as  well  as  poles.  In  these  cases 
one  could  use  other  piethods  such  as  analysis-by-synthesis  that 
includes  zeros  as  well  as  poles  in  the  approximate  spectrum.  Of 
course,  one  must  first  know  whether  the  spectrum  ■;  likely  to  have 
zeros  or  not.  This  can  be  done  from  separate  considerations,  such 
as  v/e  have?  suggested  above  for  the  recognition  cf  nasals.  There¬ 
fore,  one  must  first  perform  some  form  of  class  recognition  on 
the  sound  under  analysis.  If  that  sound  is  recognized  to  be,  say, 
a  nasal  or  a  fricative,  then  the  alternate  analysis-by-synthesis 
method  can  be  used.  Similarly,  if  a  vowel  is  next  to  a  nasal,  one 
can  assume  that  the  vowel  might  be  nasalized,  then  resort  to  the 
other  method  to  determine  formant  positions  more  accurately. 


Assuming  that  the  all-pole  model  is  adequate  for  a  parti¬ 
cular  speech  segment,  the  confidence  and  accuracy  in  relating  cer¬ 
tain  poles  of  the  linear  prediction  model  to  actual  resonances  of 
the  vocal  tract  depends  to  a  good  extent  on  the  total  number  of 
poles  p.  If  the  value  of  p  is  too  small,  there  may  not  be  enough 
poles  to  represent  all  the  resonances  in  the  frequency  range  of 
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interest.  On  the  other  hand,  if  p  is  too  large,  there  will  be 
extraneous  poles  which  might  be  mistaken  for  formants  of  the  vo¬ 
cal  tract.  It  is  clear  that  between  the  two  extremes,  there  must 
exist  some  value  (or  range  of  values)  of  p  which  is  optimal  for 
the  accurate  extraction  of  formants.  In  fact,  the  value  of  p 
should  be  set  such  that  the  linear  prediction  transfer  function 
S(z)  approximates  the  transfer  function  of  the  vocal  tract  (in¬ 
cluding  the  effects  of  the  glottal  flow  and  radiation) .  We  have 
seen  from  the  last  two  chapters  that  this  approximation  occurs  in 
the  pow’er  spectral  domain.  Namely,  the  linear  prediction  spectrum 

^  a 

P  (w)  (or  2D-spectrurc  Q(o:,u >'))  approximates  the  signal  spectrum  P(o) 
(or  2D-spectrum  Q(u>,a)')).  In  particular,  we  want  the  linear  pre¬ 
diction  spectrum  to  approximate  the  envelope  of  the  signal  spec¬ 
trum.  (Hereafter,  the  word  "spectrum"  will  refer  to  both  the  one¬ 
dimensional  stationary  spectrum  used  in  the  Autocorrelation  method, 
and  the  two-dimensional  nonstationary  spectrum  used  ir.  the  Cova¬ 
riance  method.)  What  we  are  claiming  is  the  following: 

A  value  of  p  that  results  in  an  optimal  spectral 
envelope  fit,  also  results  in  an  optimal  number 
of  poles  many  of  which  can  be  related,  with  good  (6~7) 
confidence  and  accuracy,  to  the  resonances  of  the 
vocal  tract. 

That  is,  the  optimal  value  of  p  gives  the  best  confidence  and  ac¬ 
curacy  relative  to  that  obtained  by  other  values  of  p.  The  remaining 
question  is  how  to  find  this  optimal  value  for  p. 
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The  reader  is  referred  to  Section  5.6  where  the  optimal  p 
is  deduced  from  the  normalized  error  curve.  There,  the  discussion 
was  restricted  to  the  Autocorrelation  method.  Here  we  shall  ex¬ 
tend  the  results  of  Section  5.6  to  the  Covariance  method  as  well. 
We  shall  define  the  normalized  error  V  in  the  Covariance  method 
as  equal  to 

c—«  ^Qk 

V  =  -r*—  =  1  -\  a,  -r —  (Covariance  Method)  (6-8) 

P  *00  ^  k  *00 

where  is  the  minimum  total-squared  error  in  (3-19) ,  and  *0Q 
is  the  energy  in  N  samples  of  the  signal.  We  have  found  that  the 
behavior  of  V  in  the  Covariance  method  is  very  similar  to  that 
in  the  Autocorrelation  method.  In  both  methods  the  error  curve 
exhibits  a  "knee"  after  which  the  curve  slopes  down  at  a  slow 
rate.  The  optimal  value  of  p  is  that  point  where  the  error  curve 
begins  to  fall  off  slowly.  This  method  has  been  corroborated  by 
informal  observations.  However,  we  have  seen  that  the  bandwidths 
of  the  resulting  formants  were  less  accurate  and  more  variable 
than  the  formant  frequencies. 

Statement  (6-7)  and  the  above  procedure  for  finding  the  op¬ 
timal  value  for  p  are  correct  only  if  the  all-pole  model  is  ade¬ 
quate.  For  purposes  of  speech  synthesis  this  is  generally  the 
case.  However,  as  we  have  seen  above,  if  relatively  accurate 
formant  (or  antiformant)  information  is  needed,  then  the  all¬ 
pole  model  is  not  adequate  for  sounds  with  antiformants,  such  as 
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nasals  and  fricatives.  In  these  cases  it  is  not  clear  how  one 
would  choose  an  optimal  value  for  p,  if  such  a  value  exists.  We 
shall  illustrate  this  problem  by  an  example.  Figure  6-4  shows 
the  normalized  error  curve  for  the  nasal  [n]  in  the  word  "nickel" 
by  the  same  speaker  associated  with  Fig.  6-3.  (The  analysis  was 
done  using  the  direct  Autocorrelation  method,  but  the  discussion 
here  also  applies  to  the  Covariance  method.)  The  point  after  which 
the  error  curve  slopes  down  slowly  is  around  p=12.  For  this  value 
of  p  we  show  the  approximate  and  signal  spectra  in  Fig.  6-5.  Only 
the  first  and  fourth  formants  appear  in  the  approximate  spectrum. 

In  the  signal  spectrum  one  can  clearly  see  in  addition  two  other 
formants  between  the  first  and  fourth.  In  order  for  these  two 
other  formants  to  appear  in  the  approximate  spectrum  we  must  in¬ 
crease  the  value  of  p.  From  Fig.  6-4  we  see  chat  at  p=18  there 
is  a  noticeable  decrease  in  the  error  curve  from  the  value  at  p=12. 
We  interpret  such  a  change  in  the  error  curve  as  reflecting  a 
correspondingly  noticeable  change  in  the  approximate  spectrum. 

This  change  is  evident  in  Fig.  6-6  where  the  two  formants  between 
the  first  and  fourth  are  now  evident  in  the  approximate  spectrum. 
Unfortunately,  this  caused  side  effects  around  the  first  formant 
and  at  high  frequencies.  The  position  of  the  first  peak  moved 
closer  to  that  of  the  first  harmonic  and  another  wide  bandwidth 
pole  was  introduced  next  to  it;  it  is  no  longer  clear  where  the 
first  formant  really  is.  At  frequencies  higher  than  3  kHz  it 


161 


Report  No.  2304 


01 


a: 

o 

a: 

cc 

iii 


3  o.oi 

M 


< 

3 

tr 

o 


0001 


■X 


Fig.  6-4.  Normalis 
"nickel' 
samp line 


G0r4- 


50f 


40 


RELATIVE 
ENERGY  30f 
(dS) 


20f— 


10, 


I 


Fig.  6-5.  Approxii 
for  [ n  ] 


0 


3 


Report  No.  2304 


Bolt  Beranek  and  Newman  Inc. 
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Fig.  6-6.  Approximate  spectrum  (p=18)  and  signal  spectrum 
for  [n]  in  the  word  "nickel". 


Fig.  6-7,  Linear  prediction  spectrum  (p=18)  using  the 

Covariance  method  for  [n]  in  the  word  "nickel". 
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looks  as  if  we  have  three  extra  peaks,  which  most  probably  do  not 
correspond  to  actual  resonances  of  the  nasal  tract,  since  that 
region  of  the  spectrum  is  at  the  noise  level.  In  summary,  in  or¬ 
der  to  have  the  linear  prediction  spectrum  show  the  formants  evi¬ 
dent  on  the  signal  spectrum,  there  are  two  problems:  (a)  one  must 
somehow  determine  the  necessary  value  of  p,  and  (b)  even  if  that 
value  of  p  is  known,  the  results  of  the  formant  extraction  may  or 
may  not  correspond  to  resonances  of  the  speech  production  mechanism, 
depending  on  the  particular  sound. 

6.23  Ilethod  of  Analysis 

From  a  purely  theoretical  point  of  view,  the  assumptions  un¬ 
derlying  the  Covariance  method  are  superior  to  those  underlying 
the  Autocorrelation  method.  The  Covariance  method  assumes  that 
the  signal  in  the  frame  of  interest  is  nonstationary,  while  the 
Autocorrelation  method  assumes  that  the  signal  is  stationary. 

Speech  is  a  nonstationary  process  and  therefore  the  assumption 
of  nonstationarity  is  superior  to  that  of  stationarity .  However, 
in  any  single  frame  of  interest,  the  signal  can  be  considered  to 
be  quasi-station&ry .  In  that  case,  the  assumption  of  stationarity 
is  not  a  bad  one,  but  the  assumption  of  nonstationarity  is  still 
a  better  one. 

It  can  be  shown  that  if  a  signal  is  generated  from  an  alJ- 
pole  source,  the  Covariance  method  can  recover  these  poles  exactly 
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by  using  only  a  finite  number  of  samples  of  the  signal  (Portnoff, 

Zue  and  Oppenheim,  1972) .  The  same  is  not  true  for  the  Autocorre¬ 
lation  method  unless  the  i  nite  signal  is  considered.  However, 
very  good  approximations  to  the  poles  can  be  obtained  from  only 
a  finite  portion  of  the  signal.  Our  experience  with  real  speech 
has  been  that  if  the  period  of  analysis  is  on  the  order  of  a  pitch 
period  or  greater,  the  poles  resulting  from  both  methods  are  very 
close  to  each  other.  For  example,  Fig.  6-7  shows  the  linear  pre¬ 
diction  spectrum  (using  the  Covariance  method)  for  the  same  con¬ 
ditions  as  those  of  Fig.  6-6. 

Another  point  of  comparison  is  in  how  the  two  methods  compare 
in  an  analysis-synthesis  system.  Thus  far  we  have  not  made  such 
a  comparison.  However,  Atal  (personal  communication)  claims  that 
the  Covariance  method  produces  higher  ouality  speech  in  an  analysis- 
synthesis  system. 

6.24  Frame  Width  and  Position 

In  the  speech  production  model  in  Section  2.1,  we  defined  a 
frame  as  an  interval  of  time  within  which  the  human  vocal  tract 
can  be  assumed  to  be  fixed.  This  interval  is  usually  on  the  order 
'•‘f  10-25  msec.  A  specific  choice  for  a  frame  width  and  position 
depends  on  several  factors: 

(a)  The  type  of  signal  to  be  analyzed. 

(b)  The  application  for  the  analysis. 

(c)  Whether  one  uses  the  direct  or  indirect  method  of  analysis. 
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We  shall  be  discussing  the  above  three  factors  interchangeably, 
but  first  we  must  explain  what  we  mean  by  the  direct  and  indirect 
method  of  analysis.  In  Section  4.4  the  terras  "direct"  and  "indi¬ 
rect"  were  applied  to  the  Autocorrelation  method  to  refer  to  whe¬ 
ther  the  autocorrelation  coefficients  were  computed  from  a  windowed 
signal,  or  from  an  apparent  autocorrelation  function  which  was  com¬ 
puted  from  a  finite  portion  of  an  unwindowed  signal,  respectively. 

In  Section  4.6,  the  Covariance  method  was  reformulated  in  an  ana¬ 
logous  manner  into  a  direct  and  an  indirect  method.  Therefore,  the 
term  "direct"  implies  that  the  signal  has  been  appropriately  win¬ 
dowed,  i.e.  the  resulting  signal  is  infinite  in  extent  but  is  zero 
outside  the  frame  of  interest,  while  the  terra  "indirect"  refers 
to  the  fact  that  a  finite  unwindowed  frame  of  the  signal  is  used 
in  the  analysis  without  making  any  assumptions  about  the  signal 
outside  that  frame.  It  so  happens  that  the  two  popular  methods 
defined  in  Chapter  I  are  the  direct  Autocorrelation  and  indirect 
Covariance  methods .  however,  v/e  wish  to  emphasize  here  that  the 
issue  of  direct  versus  indirect  analysis  i  independent  of  bhe  issue 
of  Autocorrelation  versus  the  Covariance  method  which  v/e  have  al¬ 
ready  discussed.  One  important  issue  that  faces  the  direct  method 
is  a  proper  choice  cf  the  window  to  be  used  in  each  case. 

There  are  instances  during  the  analysis  of  a  speech  utterance 
v/hen  the  frame  position  and  width  are  critical  factors  and  must  be 
chosen  judiciously.  For  example,  in  analyzing  a  stop  burst,  it 
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is  best  to  have  the  frame  positioned  to  include  the  burst  and 
nothing  more.  During  rapid  transitions  (such  as  certain  vowel- 
nasal  transitions) ,  the  frame  width  should  be  small  enough  so  that 
the  sharp  transition  can  be  detected.  In  general,  the  frame  width 
and  position  should  be  chosen  such  that  the  assumption  that  the  vo¬ 
cal  tract  is  fixed  during  that  time  interval  remains  valid. 

For  fricatives,  the  frame  width  and  position  are  not  critical 
factors  in  the  analysis.  Thus,  any  "effective"  frame  width  on 
the  irder  of  10-<c5  msec  can  be  used  with  generally  similar  results. 
(The  effective  frame  width  is  discussed  in  Section  6.242.)  On  the 
other  hand,  for  sonorants,  the  frame  width  and  position  can  be  im¬ 
portant  factors,  depending  on  the  particular  application  for  the 
analysis.  Below,  we  shall  restrict  the  discussion  to  the  analysis 
of  sonorants.  It  is  hoped  that  from  the  method  of  presentation 
one  can  extrapolate  the  results  to  other  situations.  We  shall 
differentiate  bet ween  two  major  types  of  analysis:  pitch-synchro¬ 
nous  and  pitch-asynchronous . 

6.241  Pitch-Synchronous  Analysis 

Pitcn-synchronous  analysis  implies  that  one  is  somehow  able 
to  detect  pitch,  and  then  delineate  each  pitch  period  for  analysis. 
(For  example,  one  could  perform  a  pitch-asynchronous  analysis  and 
detect  pitch  pulses,  as  in  Fig.  6-la,  then  reanalyze  intervals 
between  adjacent  pitch  pulses.)  Let  us  assume  for  the  moment  that 
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the  frame  of  analysis  is  defined  to  be  the  whole  pitch  period. 

This  case  is  of  special  interest  because  a  pitch  period  represents 
(approximately)  the  impulse  response  of  the  combined  effects  of 
the  qlottal  source,  the  vocal  tract  and  radiation.  The  word 
"approximately”  was  used  because  the  signal  in  a  pitch  period  in¬ 
cludes  contributions  (though  small)  from  past  vocal  tract  exci¬ 
tations  whose  effects  have  not  completely  decayed  as  yet.  These 
contributions  increase  with  increased  pitch  (i.e.  shorter  pitch 
period,  as  for  females  and  children)  causing  the  approximation 
to  be  worse.  This  is  a  basic  loss  of  information  that  cannot  be 
recovered  without  adding  some  compensatory  information.  We  shall 
resort  to  the  frequency  domain  to  explain  what  we  mean  by  the  last 
statement.  The  impulse  response  under  discussion  is  theoretically 
infinite  (though  practically  it  dies  within  30  msec) ,  and  its 
power  spectrum  is  a  continuous  function  of  frequency.  The  power 
spectrum  of  the  response  due  to  a  periodic  train  of  unit  pulses, 
at  a  rate  of  Fq  pulses  per  second,  contains  energy  only  at  multi¬ 
ples  of  the  fundamental,  i.e.  at  f=nFg.  This  discrete  spectrum 
has  an  infinity  of  possible  envelopes.  Two  of  these  envelopes  are 
the  impulse  response  spectrum  and  the  spectrum  of  a  single  pitch 
period.  In  other  words,  the  pitch  period  spectrum  is  guaranteed 
to  be  equal  to  the  impulse  response  spectrum  only  at  multiples  of 
Fq.  To  the  extent  that  the  pitch  period  spectrum  is  not  equal  to 
the  impulse  response  spectrum  for  f^nFg,  we  say  that  information 
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has  been  lost.  It  is  easy  to  see  that  as  Fq  increases,  the  loss 
of  information  is  likely  to  increase.  It  is  in  this  sense  that 
female  or  children's  speech  (with  higher  pitch) ,  relatively  speaking 
contains  1  ess  information  about  the  response  of  the  articulatory 
mechanism  than  does  male  speech  (with  lower  pitch) .  This  loss  of 
information  is  irrecoverable  unless  extra  information  is  supplied 
from  an  independent  source.  We  shall  argue  that  linear  prediction 
supplies  extra  information  which  hopefully  recovers  part  of  the 
information  lost. 

Given  the  spectrum  of  a  single  pitch  period,  the  problem  is 
to  estimate  the  spectrum  of  the  impulse  response.  In  linear  pre¬ 
diction  the  information  takes  the  form  of  an  assumption  about  the 
nature  of  the  impulse  response  spectrum,  namely  that  it  is  all-pole. 
To  the  extent  that  the  all-pole  model  is  correct,  we  have  succeeded 
in  adding  the  needed  compensatory  information.  Thus,  recovery  of 
lost  information  is  bound  to  be  more  successful  with  vowels  (which 
are  well  modelled  by  poles)  than  with  nasals  (which  are  best  mo¬ 
delled  by  a  combination  of  poles  and  zeros) .  Supplying  additional 
information  by  judiciously  assuming  a  model  is  the  basic  idea  and 
power  behind  the  general  method  of  spectral  analysis-by-synthesis. 
Linear  prediction  is  a  special  case  of  analysis-by-synthesis  where 
the  assumed  model  is  restricted  to  be  all-pole. 

We  conclude  from  the  above  discussion  that  if  one  wishes  to 
use  the  direct  method  of  analysis  over  a  pitch  period,  then  the 
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window  to  be  used  should  be  rectangular  and  should  coincide  in 
position  and  width  with  the  pitch  period  under  analysis.  In 
other  words,  the  ramples  over  a  pitch  period  should  be  left  in¬ 
tact.  Any  window  other  than  rectangular  will  introduce  unwanted 
distortion  in  the  signal  spectrum  and  consequently  in  the  li¬ 
near  prediction  spectrum  approximating  the  impulse  response  spec¬ 
trum. 

Thus  far  we  have  assumed  that  the  frame  for  analysis  consists 
of  the  whole  pitch  period.  There  are  applications  for  which  it 
is  desirable  to  perform  the  analysis  on  only  a  portion  of  the’ 
pitch  period.  The  portion  of  the  signal  during  which  the  glottis 
is  closed  is  of  particular  interest.  It  is  well  known  that  the 
major  excitation  of  the  vocal  tract  occurs  at  the  closing  of  the 
glottis.  Thus,  during  the  first  portion  of  a  pitch  period  the 
glottis  is  closed.  The  vocal  tract  is  excited  again  as  the  glottis 
opens,  but  to  a  lesser  degree.  The  vocal  tract  resonances  are 
different  in  the  closed-  and  open-glottis  conditions,  when  the 
glottis  is  closed,  the  subglottal  tract  is  decoupled  from  the 
system  and  the  resonances  are  those  of  the  vocal  tract  proper. 

When  the  glottis  is  open,  there  is  coupling  to  the  subglottal 
tract,  thus  causing  changes  in  the  over-all  system  resonances. 

In  particular,  banawidths  te^'-’  to  be  larger  when  the  glottis  is 
open.  Coupling  to  the  subglottal  tract  could  also  introduce  extra 
zeros  and  pcles  in  the  signal  spectrum.  By  analyzing  the  whole 
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pitch  period,  ^ne  is  actually  averaging  out  the  closed-  and 
open-glottis  characteristics.  The  result  is  often  reflected  in 
variability  of  the  formant  bandwidths  and,  to  a  lesser  extent, 
the  formant  frequencies.  Therefore,  in  order  to  obtain  accurate 
formant  information  for  the  vocal  tract,  it  is  best  to  perform 
the  analysis  on  the  portion  of  the  signal  when  the  glottis  is 
closed  (see  Pinson,  1963) .  The  problem  here  is  to  know  when 
the  glottis  is  closed  in  relation  to  the  signal.  The  only  thing 
we  are  sure  of  is  that  the  glottis  is  closed  during  the  first 
portion  of  the  pitch  period.  This  interval  car.  be  anywhere  be¬ 
tween  zero  to  a  few  milliseconds,  depending  on  the  condition  of 
phonation.  Although  we  cannot  be  sure  of  the  glottis  condition 
it  would  still  be  more  accurate,  on  the  average,  to  analyze  the 
first  portion  of  the  pitch  period  than  to  analyze  the  whole  pitch 
period. 

Analyzing  a  portion  of  the  pitch  period  is  best  done  using 
the  indirect  method.  The  direct  method  is  bound  to  give  gross 
errors  (see  the  discussion  on  windowing  below) .  We  note  here 
that  the  indirect  Cov.-.riance  method  as  well  as  one  oi  the  indirect 
Autocorrelation  methods  require  a  minimum  interval  of  analysis 
equal  to  2p  samples,  where  p  is  the  number  of  predictor  coeffi¬ 
cients. 
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6.242  Pitch-Asynchronous  Analysis  and  Windowing 

As  "pitch- a synchronous"  suggests,  the  frame  width  and  posi¬ 
tion  are  here  independent  of  pitch  information.  This  poses  no 
serious  problems  if  the  indirect  method  is  used,  and  the  results 
would  vary  little  from  those  obtained  using  pitch- synchronous 
analysis,  especially  if  the  frame  width  is  on  the  order  of  a 
pitch  period  or  larger.  However,  if  the  direct  method  is  to  be 
used,  the  results  could  vary  a  great  deal  depending  on  the  frame 
width  and  the  window  shape  used.  We  shall  now  discuss  the  problem 
of  windowing  in  the  direct  method  of  analysis.  The  discussion 
will  be  detailed  and  rigorous  because  we  feel  that  the  subject  of 
windowing  has  not  been  treated  with  enough  rigor  in  the  past, 
when  applied  to  speech  analysis. 

In  discussing  pitch-synchronous  analysis  with  the  direct 
method,  we  saw  that  a  rectangular  window  over  the  whole  pitch 
period  is  best,  because  we  are  then  certain  that  the  signal  spec- 
trurrrwould  equal  the  impulse  response  spectrum  at  least  at  mul¬ 
tiples  of  the  fundamental  frequency  Fq.  The  best  we  can  hope  for 
in  pitch-asynchronous  analysis  is  that  the  signal  spectrum  appro¬ 
ximate,  as  well  ar  t  -sible,  the  spectral  values  of  the  impulse 
response  spectrum  at  f=nFQ.  This  is  the  purpose  of  windowing. 

We  shall  again  resort  to  the  frequency  domain  to  show  how  our 
objective  can  be  accomplished  by  proper  windowing.  For  sim¬ 
plicity,  the  discussion  will  be  carried  on  for  continuous  time 
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signals,  but  the  results  will  apply  also  co  discrete  or  sampled 
signals. 

Let  x(t)  be  a  periodic  signal  with  period  t  =  -4-  and  Fourier 

0 

integral  transform  X(f).  Let  s(t)  be  the  signal  obtained  by 
multiplying  x(t)  by  a  window  function  w(t): 

s(t)  =  w(t.t  x (t ) .  (6-9) 

Then  the  Fourier  transform  of  s(t)  is  the  convolution  of  the 
transforms  of  v(t)  and  x(t): 


S(f)  =  W(f)®  X(f) 


W(f-A)  X(X)  dA, 


(6-10) 


where  S(f)  and  V7(f)  are  the  Fourier  Transforms  of  s(t)  and  w(t), 
respectively,  and  the  symbol  ®  represents  convolution. 

Since  x(t)  is  a  periodic  signal,  its  Fourier  transform  X(f)  is  a 
line  spectrum  that  can  be  represented  by 


X(f)  =  £z(f)  uo  ^f-nFo) '  (6-11) 

n=-co 

where  uQ(f)  is  the  unit  impulse  function  defined  in  (4-40), 
and  Z (f)  is  some  envelope  function  whose  valuer  are  specified 
at  f  =  nFQ,  but  can  be  arbitrary  otherwise.  (For  example,  one 
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can  think  of  Z(f)  as  the  transform  of  the  impulse  response  of  the 
vocal  tract,  or  as  the  trar-sfcrm  of  a  single  pitch  period.  The 
two  transforms  are  equal  ..<•  f=nFQ,) 

Substituting  (6-11)  xn  (6-^  •  orforming  the  integration,  and  re¬ 
placing  n  by  it,  we  obtain: 
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W(f)  = 


sin  (irMf/Fg) 


irMf/F, 


(6-16) 


Substituting  (6-16)  in  (6-13) ,  we  obtain: 


S(nF0>  =  £ 


sin[TrM(n-m)  ] 


Trrctn-f 


Z(mFQ) 


(6-17) 


Note  that  the  v;indow  term  in  (6-17)  is  equal  to  zero  for  all 
values  of  in  except  for  m=n,  when  it  is  equal  to  1. 


Therefore.-  (6-17)  reduces  to 


S  (nFp)  =  Z(nFQ)  , 


(6-18) 


which  is  identical  to  (6-14).  Therefore,  (6-14)  is  exactly  sa¬ 
tisfied  fcr  a  rectangular  window  whose  width  is  equal  to  an  in¬ 
tegral  nur.  jr  of  pitch  periods.  In  particular,  it  is  true  for 
a  single  pitch  period,  a  result  that  we  already  know. 

The  above  result  clearly  satisfies  (6-14),  which  is  our  ob¬ 
jective,  but  it  suffers  from  one  major  drawback,  namej.y  that  the 
window  depends  on  the  exact  length  of  a  pitch  period.  Thus,  it  if 
really  a  pitch-dependent  window,  which  is  of  little  use  in  pitch- 
asynchronous  analysis.  We  need  a  pitch-asynchronous  window,  one 
whose  width  does  not  depend  on  the  exact  length  of  the  pi.t:cb 
period,  and  which  satisfies  (6-14)  as  well  as  possible... 


17b 


We  note  again  that  what  allowed  us  to  reduce  (6-17)  to  (6-18) 
was  the  important  fact  that  the  window  term  was  equal  to  zero  for 
all  values  of  m  except  for  m=nf  when  it  was  equal  to  1.  In  other 
words,  we  have  W(0)=1,  and  W(nFg)=0  for  all  n.  If  we  could  find 
window  functions  such  that  VJ(0)=1  and  |w(nFQ)|<e  for  all  n,  where 
e<<l,  then  (6-14)  would  be  approximately  satisfied.  doing  further, 
if  | W ( f ) 1 < e  for  all  f>FQ,  then  clearly  |W(nFg)|<e  is  satisfied, 
and  our  objective  is  also  achieved.  A  value  of  VJ(0)  different 
from  1  merely  introduces  a  multiplicative  constant  to  (6-14),  which 
can  be  easily  corrected  for.  What  is  important  in  specifying  a 
window  is  the  relative  amplitude  of  W(f)  with  respect  to  W(0). 
Therefore,  our  only  condition  that  a  window  function  must  satisfy 
is: 

JH£L<e  ,  |f|»F0  ,  e«l.  (6-19) 

K(0) 

One  often  nicks  e<0.02  for  good  results.  This  is  equivalent  to 
W(f)  being  at  least  34  dB  below  the  peak  V7(Q)  for  ^F^.  We  shall 
now  give  a  few  examples  of  window  functions  that  have  been  sug¬ 
gested.  These  functions  have  the  property  that  they  are  even 
functions  of  time.  Although  this  property  is  not  required  for 
our  application,  it  clearly  does  no  harm. 

There  are  two  major  families  of  window  functions  that  are 
in  use  today.  The  first  is  what  we  shall  call  the  Cosine  family. 
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These  functions  are  raised  cosines  or  convolutions  of  raised  co¬ 
sines.  The  two  most  popular  Cosine  window  functions  are  the 
Hanning  and  Hamming  windows  (Blackman  and  Tukey,  1958,  pp.  95-99). 
These  are  given  bv: 


Hanning : 

wH(t) 

Ilammina: 

wh(t> 

where 

Xs 

(6-20) 


(6-21) 

(6-22) 


is  the  window  size  or  v/idth .  and  u_^(x) 
defined  by: 

/ 


0  ,  x<0, 


is  the  unit-step  function 


u  , (x)  =/  (6-23) 

jl  ,  x>0. 

Both  windows  in  (6-20)  and  (6-21)  have  been  normalized  such  that 
W (0) =1 . 


The  other  major  family  of  window  functions  is  what  we  shall 


the  form 


,n 
sm  irx 


call  the  SINC  family,  because  their  Fourier  transforms  are  of 

In 


TST 


sin  TT  X 

,  and  the  function  _____  is  often  referred  to 


as  sine  x.  This  family  is  generated  in  the  time  domain  by  (n~l) 
convolutions  of  the  rectangular  window,  with  the  appropriate  nor¬ 
malization  to  keep  the  window  size  equal  to  t1.  This  family  is 
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represented  in  the  frequency  domain  by: 


sin  (2irfT  / n } 


Wn(f)  2nfTw/n 


(6-24) 


where  n  is  the  order  of  the  window.  Thus,  K^(f)  is  the  rectan¬ 
gular  window,  V&2  (f ;  is  the  triangular,  etc.  It  has  been  shown 
that  the  corresponding  time  domain  window  wn(t)  is  given  by 


(Makhoul,  1970a): 


w  k=0  w/  / 


(6-25) 


where  n  is  any  positive  integer. 


f  n]  =  _ Hi _ 

\  kj  ki  (n-k) ’ 


n-*  1 

integer  portion  of  — y- 


A  window  that  is  of  particular  interest  is  w^  (t)  ,  which  is  some¬ 
times  called  the  Parzen  window,  given  by: 


w4(t)  = 


rj(l-  lfe03u  fi  _  -  4f|  -  i|i]3u.1/^  -  M 

M  ■»/  \  w]  \  “/  \  * 


(6-26) 


In  order  to  see  how  (6-19)  might  apply,  we  shall  discuss 
three  windows  :  the  rectangular,  Parzen,  and  Hamming  windows. 
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The  three  windows  are  shown  in  Fig.  6-8,  along  with  a  summary 
of  their  spectral  characteristics 4  A  plot  of  the  power  spectrum 
for  each  window  is  shown  in  Fig.  6-9.  We  shall  first  discuss 
the  Hamming  window  spectrum  shown  in  Fig.  6 -9b.  We  note  that 
for  f>2fQ,  |Wh(f) |  is  at  least  40  dB  below  W^(0),  and  hence  (6-19) 
will  apply  with  e  =  .01  if  the  following  condition  holds: 


where  f0  =Tr  =  ' 

and  F0  =  “  • 

(t  is  the  pitch  period  and  x'  is  the  window  size.) 

From  (6-27)  and  (6-28)  we  obtain  the  desired  relation: 


(6-27) 


(6-28) 


t 1 >  2t  .  (Hamming) 


(6-29) 


(6-29)  says  that  in  order  to  cruarantee  that  the  signal  spectrum 
be  very  nearly  equal  to  the  impulse  response  spectrum  for  f  =  nFQ, 
the  window  size  x'  must  be  at  least  twice  the  pitch  period.  Since 
we  know  the  general  range  of  pitch  periods  for  human  voices,  it 
is  easy  to  satisfy  (6-29)  most  of  the  time.  As  a  rule  of  thumb, 
when  using  a  Hamming  window,  a  window  size  of  at  least  20  msec 
should  give  good  results  (this  corresponds  to  x  =  10  msec) . 

The  same  analysis  can  be  applied  to  the  Parzen  window  spectrum 
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WINDOW 

3dB 

BANDWIDTH 

LARGEST  SI DEL08E 
RELATIVE  TO 

MAIN  LOBE 

HIGH-FREQUENCY 

ROLL-OFF 

RECTANGULAR 

« 

0.9  f0 

— 13.3  dB 
(1st  SIDELOBE) 

-6 dB/OCT 

PARZEN 

1.8  f0 

-53.1  dB 
(1st  SIDELOBE) 

-24  dB/OCT 

HAMMING 

1.33  f0 

-43  dB 

(4th  SIDELOBE) 

-6  dB/OCT 

(d) 


Fig.  6-8. 


(a)  Rectangular  window. 

(b)  Parzen  window. 

(c)  Hamming  window 

(d)  Summary  of  spectral 

three  windows,  fg  * 


.aracteristics  for  the 
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in  Pig.  6-9c,  and  we  obtain  the  relation: 


x  *  2-3t  , 


(Parzen) 


(6-30) 


with  e  *  .01  (40  dB) .  This  means  that  if  the  Parzen  window  is 
used,  the  window  should  equal  nt  least  three  pitch  periods  if 
very  good  results  are  desired.  Conditions  (6-29)  and  (6-30)  can 
be  relaxed  by  about  20%  with  generally  adequate  results. 

Returning  to  the  power  spectrum  of  the  rectangular  window  in 

Fig.  6-9a,  we  see  that  (6-19)  cannot  apply  with  e  =  .02  (34  dB) 

‘f* 210.  In  fact,  the  best  that  can  be  achieved  is  an  e  =  .03 
f° 

for  "^-10.  This  is  bad  *cr  two  reasons:  a)  e  is  on  the  high  side, 

and  therefore  the  approximation  will  be  worse,  and  b)  >10  means 

u 

that  x'>10x,  i.e.  che  window  size  is  10  times  the  pitch  period, 

which  is  far  greater  than  the  frame  size  that  cuz  model  allows  (for 

good  results).  The  best  compromise  is  s  =  0,1  (20  dB)  for  -f-  >4, 

f  0 

But  this  e  is  quite  high.  The  conclusion  is  that  the  rectangular 
window  is  not  a  good  window  for  pitch-asynchronous  analysis. 

One  conclusion  we  can  draw  from  the  above  discussion  is  that 
the  frame  width  should  be  on  the  order  of  at  least  2  pitch  periods 
if  one  is  to  obtain  good  results  with  pitch-asynchronous  analysis 
using  the  direct  method.  This  explains  why  analyzing  a  portion  of 
a  pitch  period  using  the  direct  method  is  not  recommended. 

Below  we  shall  make  use  of  the  notion  of  the  "effective"  width 
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of  a  window.  Although  an  actual  window  width  is  enaal  to  x' ,  its 
effective  width  is  generally  less  than  that,  because  the  signal 
samples  are  weighted  by  the  window.  (We  are  assuming  here  that 
the  area  under  the  window  is  always  constant  and  is  equal  to  1, 
i.e.  W(0)  =  1.)  It  is  reasonable  to  assume  that  the  effective 
width  of  a  rectangular  window  is  equal  to  its  actual  width  x*. 

We  shall  assume  further  that  the  effective  width  of  any  window  is 
inversely  proportional  to  its  bandwidth.  From  the  last  two  assump¬ 
tions,  we  can  define  the  effective  width,  ,  of  a  window  to  be 
equal  to: 


' 

le 


(6-31) 


where  is  the  bandwidth  of  the  rectangular  window,  and  B  is 

the  bandwidth  of  the  window  whose  effective  width  is  desirea. 

0  9 

From  Fxg.  6-8d  we  see  that  8-^  -  G.9fn  =  — .  Substituting  for 
in  (6-31) ,  we  obtain: 


where  B  is  measured  in  Hz  and  x^  in  sec. 

For  example,  the  bandwidth  of  the  Hamming  window  from  Fig.  6-8d 
is  B  =  1.33f0  =  ,  From  (6-31),  x*  =  0.68x',  and  the  effec-  . 

tive  width  of  a  Hanuning  window  is  about  two-thirds  its  actual 
width.  We  must  stress  here  that  (6-31)  is  but  one  of  many  other 
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reasonable  definitions. 

We  have  thus  far  discussed  methods  of  windowing  that  would 
lead  to  good  results  when  using  the  direct  method.  The  question 
now  is  how  the  direct  method  compares  with  the  indirect  method 
in  pitch-asynchionous  analysis.  In  order  to  do  che  comparison 
fairly,  the  "effective"  frame  width  for  both  types  of  analysis 
should  be  the  same.  We  nave  already  discussed  above  how  to  find 
the  effective  frame  width  in  the  direct  method.  In  many  formula¬ 
tions  of  the  indirect  method,  the  signal  samples  are  weighted 
equally,  hence  the  effective  frame  width  is  equal  to  the  actual 
frame  width.  Therefore,  if  a  Hamming  window  is  used,  for  example, 
on  a  20  msec  frame,  the  effective  frame  width  is  20  x  0.68  - 
13.6  msec.  Therefore,  the  frame  width  corresponding  to  the  !J 
samples  in  the  indirect  method  should  be  13.6  msec.  It  is  rea¬ 
sonable  to  assume  that  the  13.6  msec  frame  would  be  centered  with¬ 
in  the  20  msec  frame. 

Given  the  above  basis  for  comparison  we  have  found  that  the 
direct  Autocorrelation  method  and  the  indirect  Covariance  method 

A 

gave  practically  the  same  results  for  the  poles  of  S(z)  for  effec¬ 
tive  frame  widths  larger  than  a  pitch  period. 

As  a  general  rule  of  thumb,  the  indirect  method  works  well 
for  almost  any  frame  size,  but  the  direct  method  works  well  only 
for  a  frame  size  of  at  least  one  pitch  period,  with  a  proper  choice 
of  window  shape. 
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In  the  beginning  of  Section  6.2  we  indicated  how  one  might 

A 

deduce  formant  values  from  the  poles  of  S(z)  in  (6-4)  .  We  mentioned 
then  that  peak  picking  could  be  performed  on  the  approximate  spec¬ 
trum  P(w)  as  a  double  check  on  the  formant  values.  In  this  section 
we  shall  discuss  briefly  the  possibility  of  formant  extraction  by 

peak  picking  alone,  avoiding  the  computation  necessary  to  solve  a 
p-th  degree  polynomial  (where  p  is  usually  greater  than  3.0)  « 

Most  formants  show  up  as  peaks  in  the  approximate  spectrum 
because  they  usually  have  a  high  Q  (ratio  of  frequency  to  band¬ 
width)  .  However,  there  are  cases  when  peaks  don't  show  up  very 
well,  usually  because  the  formant  has  low  Q,  and  in  addition  may 
be  close  to  another  formant  with  a  dominating  peak.  Below,  we 
shall  discuss  two  methods  for  improving  the  shape  of  the  approxi¬ 
mate  spectrum  so  that  peak  picking  will  give  good  results  for  most 
cases.  We  should  point  out  here  that  peak  picking  has  one  inher¬ 
ent  drawback,  namely  that  the  formant  values  obtained  are  only 
approximately  equal  to  those  that  would  be  obtained  by  finding  the 
poles  of  S(z).  This  is  due  to  the  fact  that  the  formant  peak  does 
not  occur  exactly  at  the  formant  frequency.  That  difference  be¬ 
comes  smaller  as  the  formant  bandwidth  decreases.  In  addition,  the 
position  of  a  formant  peak  is  also  dependent  on  the  positions  of 
neighboring  formants.  However,  for  many  applications,  peak  picking 
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can  give  adequate  accuracy  for  formant  values. 


6.251  Preemphasis 


t 
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[ 
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One  method  that  usually  improves  the  effectiveness  of  peak 
picking  is  preemphasis .  We  have  already  discussed  some  of  the 
properties  of  preemphasis  by  differencing  in  Section  5.5.  We  saw 


that  differencing  attenuates  the  energy  at  very  low  frequencies 
and  enhances  the  energy  at  high  frequencies  at  the  rate  of  approxi¬ 
mately  +6dB/octave.  The  positive  effect  of  this  type  of  pre¬ 
emphasis  on  peak  picking  is  two-fold:  a)  Attenuation  of  the 
energy  at  low  frequencies  eliminates  peaks  due  to  the  glottal 
source.,  peaks  that  otnerwise  might  be  mistaken  for  vocal  tract 
formants,  and  b)  because  of  the  resulting  increase  in  the  spec¬ 
tral  slope,  formants  that  are  overshadowed  by  neighboring  higher 
amplitude  formants  would  now  appear  as  peaks.  One  disadvantage 
of  p.reemphasis  i3  that  it  causes  shifts  in  computed  formant  fre¬ 
quencies  and  bandwidths.  This  effect  is  most  noticeable  with  the 
first  formant.  However,  these  shifts  are  not  significant  in 
general,  and  can  be  disregarded  for  many  applications. 

We  saw  in  Section  5.5  that  preemphasis  by  differencing  is 
equivalent  to  introducing  a  zero  at  zero  frequency  (z=l)  in  the 
signal  spectrum.  This  zero  should  approximately  cancel  one  of 
the  low  frequency  poles,  and  hence  one  less  pole  would  be  needed 
in  the  linear  prediction  all-pole  approximation .  We  have 
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found  that  if  a  certain  value  of  p  is  optimal  (in  the  sense  given 
by  (6-7))  for  some  signal,  then  a  value  of  (p-1)  is  optimal  tor 
the  differenced  signal. 

We  shall  demonstrate  some  of  the  above  properties  by  an  ex¬ 
ample.  Figure  6-10a  shows  the  original  and  linear  prediction 
spectra  for  [w]  in  the  word  "anyone"  [eniwAn].  The  analysis  was 
done  using  the  direct  Autocorrelation  method  on  a  25  msec  Hamming- 
windowed  signal,  with  p=12.  The  corresponding  analysis  for  the 
differenced  signal  is  shown  in  Fig.  6-10b  with  p=ll  (p  was  re¬ 
duced  by  1  according  to  the  above  discussion) .  The  low  frequency 
effect  due  to  the  glottal  source  is  evident  in  Fig.  6- 10a  but 
disappears  m  Fig.  6-10b.  The  second  formant  does  not  form  a 
peak  in  Fig.  6-10a  but  its  peak  is  quite  clear  in  Fig.  6-10b. 

In  order  to  see  the  differences  in  computed  formant  frequencies 
we  refer  to  Fig.  6-11.  Figure  6-lla  shows  the  formant  frequencies 
obtained  by  peak  picking  from  256-point  FFT-computed  spectra 
(i.e.  128  spectral  values  over  5  kHz).  The  value  of  the  frequency 
at  which  a  peak  occurred  was  refined  by  using  a  parabolic  fit  to 
the  three  points  around  the  peak.  Figure  6-12  shows  an  example 
of  such  curve  fitting.  Given  three  points  around  the  peak,  the 
position  of  the  peak  can  be  shown  to  be  ar. : 


xm  = 
m 


1  A1  +  A2 


(6-32) 
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FREQUENCY 
(  kHz  ) 


Fig.  6-10.  (a)  Analysis  of  [w]  in  the  word  "anyone",  using 
the  Autocorrelation  method.  Window  size  is 
25  msec,  p=12. 

(b)  Analysis  of  the  differenced  signal,  p=ll. 
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Pit-'.  6-11  Formant  values  For  the  sirnal  associate*:  with  :'jf*.6-]0.  j 

(a)  Formant  Freouencios  obtained  by  oeak  r-ickinr  with  --1 

parabolic  interpolation. 

(b)  Formant  Frequencies  and  bandwtdths  obtainrr  ~on  i 

the  noles  oF  *(z).  .,! 


Fig.  6-12  Hefininr,  of  peak  estimation  by  oarabolic  curve 

fittinr.  (x  . v  )  are  the  coordinates  of  the 
r.  m 

hypothesized  peak. 
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where 


A,  = 


1  *  y0  -  y-l 


and  A2  “  yl  ‘  yo  * 

The  peak-picked  formant  frequencies  shown  in  Fig.  6-lla  are  to  be 

a* 

compared  with  those  obtained  from  the  poles  of  S(z)  and  shown  in 
Fig.  6-llb,  where  the  formant  bandwidths  are  shown  in  addition. 

A 

These  formant  values  are  computed  from  the  poles  of  S(z)  as 


follows: 


w_ 

v  —  2 

Fn  “  Trf 


B  =  1— 
n  tt 


(6-33) 


where  w  and  a  are  computed  from  (6-5)'  and  (6-6)  .  (The  definition 
of  bandwidth  in  (6-33)  is  not  exactly  equivalent  to  the  3-dB  defi¬ 
nition,  but  it  gives  similar  results  for  high-Q  formants.) 

We  note  from  Fig.  6-lla  that  a  peak-picked  formant  frequency  is 
closer  to  the  computed  frequency  in  Fig.  6-llb  when  the  formant 
bandwidth  is  small,  as  is  the  case  with  the  third  formant  in  this 
example.  We  also  note  that  the  largest  relative  change  in  fre¬ 
quency  between  the  formant  values  for  the  original  signal  and 
those  of  the  differenced  signal  occurs  for  the  first  formant. 


Although  we  have  not  done  so  here,  it  is  also  possible  to  es¬ 
timate  the  formant  bandwidths  from  the  approximate  spectrum  by 
simply  measuring  (with  interpolation)  the  frequency  interval  between 
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the  -3  dB  points  below  each  peak.  Accu_ate  values  would  result 
only  for  high-Q  formants. 

Although  the  application  of  preemphasis  to  the  speech  sign- 
might  iaprove  the  results  of  formant  extraction  by  peak  picking , 
it  involves  a  distortion  of  the  signal  (by  differencing  in  our 
case)  which  has  3ome  bad  side  effects,  e.g.  the  norma) lzed  error 
becomes  useless  as  a  voicing  detector  (see  Section  5,5).  We  shall 
now  describe  a  second  method  that  improves  the  results  of  peak 
picking  without  affecting  the  signal  in  any  way. 

6.252  Off-Axis  Spectrum 

We  know  that  formants  with  small  bandwidths  show  up  very  well 
as  peaks  in  the  approximate  spectrum  P(u>)  because  the  poles  cor¬ 
responding  to  these  formants  lie  close  to  the  contour  along  which 
the  spectrum  is  computed  (the  unit  circle  in  the  z-plane  or  the 
jw-axis  in  the  s-plane) .  Therefore,  for  those  formants  whose  peaks 
do  not  show  up  in  the  spectrum,  one  could  enhance  the  peaks  by 
moving  the  contour  along  which  the  spectrum  is  computed  closer  to 
these  formant  poles.  In  order  to  see  how  this  might  be  done  ef¬ 
ficiently  and  effectively,  we  shall  first  define  a  more  general 
linear  prediction  "spectrum"  P(ofu)  given  by: 

P(a,o))  =  |S(*)j2,  z  =  e (o+ jo) T  , 
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P(c,w)  ~ 


-£ 


k  ' 


.4-k(o+jw)T 


k-1 


p  /  \ 

~T~ 

,  irl  -koT 

x  -  2^  ake 

e-jkwT 

k=l\  1 

(6-34) 


r'  a  »  0,  P(a,w)  reduces  to  P(<*0  defined  in  (4-6a)  . 
jonstant  (a  *  oQ) ,  then  (6-34)  reduces  to: 


If  a  is  a 


P(or  .oj)  a 


A 


1  -  E  c'3k“T 


(6-35) 


1  k=l 

j 

where 

dk  =  a^  g  ,  lsk<p  , 

(6—36) 

and 

— n  T 

<7  *  e  o  a*  1  -a  T  , 

-  o 

for  |o  Tj«l. 
o  1 

(6-37) 

P(oo,w) 

in  (6-35)  has  the  form  of  a 

regular  spectrum 

(see  Appen- 

dix  C  on  how  to  compute  such  a  spectrum)  computed  from  a  new 
sequence  of  coefficients  d^  which  are  obtained  by  multiplying 
the  coefficients  a^  by  an  exponential,  as  shown  in  (6-36)  and 
(6-37).  Since  o  =  aQ  defines  a  line  parallel  to  the  jw-axis  in 

A 

the  s-piane,  we  call  P(oq,u>)  an  off-axis  spectrum.  It  is  equiva¬ 
lent  to  computing  the  spectrum  in  the  z-plane  along  a  circle  of 
radius  r  =  ~  concentric  with  the  unit  circle.  An  illustration 
of  the  peak  enhancing  ability  of  the  off-axis  spectrum  is  presented 
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below.  < 

■< 

The  locations  in  the  s-plane  of  the  first  four  formants  of 
the  original  signal  ;n  Pig.  6-llb  are  shown  in  Pig.  6-13.  The 
off-axis  spectrum  tor  aQ  •  -2ir  x  75  (g  =  1.048)  is  shown  in 
Fig.  6-14.  ,rhis  is  to  be  compared  with  the  regular  spectrum 

shown  in  Fig.  '9a.  The  second  formant  now  shows  up  as  a  defi-  f 

nite  peak  in  the  off-axis  spectrum.  Also,  the  peaks  correspond-  « 

ing  to  and  P4  have  become  sharper  (more  peaked) ,  while  the  F3  } 

peak  remained  about  the'  same.  Sharper  packs,  of  course,  mean  that 

* 

the  new  peak-picked  formant  frequencies  are  closer  to  the  actual 

*  ^ 

formant  locations.  \ 

2 

One  should  be  able  to  estimate  the  formant  bandwidths  by  | 

"°o  ') 

adding  -y-  to  the  3  dB  bandwidths  of  the  peaks  in  the  off-axis  j 

spectrum.  This  indeed  gives  correct  results  for  F^,  F 2  and  F4  in 

this  case,  but  not  for  F3,  because  now  lies  to  the  right  of 

the  o^-axis.  For  such  poles,  the  estimated  bandwidth  is  obtained 

by  subtracting  the  measured  3  dB  bandwidth  from  — Unfortunately , 

there  is  no  way  to  tell  whether  a  formant  lies  tc  the  right  or  to 

the  left  of  the  oQ-axis  from  the  off-axis  magnitude  spectrum.  (Note 

that  the  same  is  also  true  for  the  regular  spectrum,  except  in 

that  case  we  alraady  know  that  all  poles  must  lie  to  the  left  of 

the  jw-axis.)  Now  we  see  why  the  F-j  peak  was  about  the  same  in 

Figs.  6-10a  and  6-14:  F^  is  equally  distant  from  the  jw-axis  and 
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Pig.  6-13.  Location  in  the  3-plane  of  the  formants  shown 
in  Fig.  6-llb  for  the  original  signal. 


Fig.  6-14.  The  name  linear  prediction  spectrum  shown  in 
Fig.  6-10a  except  that  here  the  spectrum  was 
computed  inside  the  unit  circle  (Oq«-2itx75)  . 
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the  oQ-axis,  as  shown  in  Fig.  6-13.  Therefore,  the  off-axis- 
spectrum  method  has  the  disadvantage  that  some  bandwidth  informa¬ 
tion  might  be  lost.  However,  it  is  easy  to  see  that  such  band- 

t 

width  information  can  be  retained  by  also  computing  the  regular 
magnitude  spectrum  or  a  phase  spectrum. 

For  formant  peak  enhancement ,  we  wish  to  use  a  value  of  c 

o 

which  is  closer  to  the  poles  of  interest,  on  the  average,  than  is 
the  jw-axj.!:'. ,  Since  we  expect  the  first  four  formants  to  have  band- 
widths  in  the  range  0-300  Hz,  a  value  of  oq  corresponding  to  a 
formant  bandwidth  of  150  Hz  (i.e.  oQ  =  -2*  x  75)  should  work  well. 
We  have  found  this  value  to  be  effective. 

A  line  parallel  to  the  jw-axis  is  only  one  of  many  possible 
contours  that  would  be  effective  in  improving  the  results  of  for¬ 
mant  extraction  by  peak  picking.  Another  possibility  is  to  compute 
the  spectrum  along  an  arbitrary  straight  line  in  the  s-plane.  (The 
corresponding  contour  in  the  z— plane  is  a  spiral.)  Such  a  spectrum 
can  be  computed  using  the  chirp  z-transform  (CZT)  (Rabiner,  Schafer 
and  Rader,  1969).  This  type  of  contour  makes  sense  in  speech  ana¬ 
lysis  because,  generally  speaking,  formant  bandwidths  increase 
with  frequency.  Unfortunately,  computing  the  CZT  is  quite  expen- 
sive,  and  it  is  not  clear  that  it  would  be  cost-effective.  We 
would  like  to  point  out  here  that  the  eff-axis  spectrum  would 
be  a  special  case  where  the  arbitrary  line  happens  to  be  parallel 
sto  the  jw-axis.  However,  in  that  case,  the  method  described  in 
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equations  (6-35)  through  (6-37)  is  much  more  efficient  than 
computing  the  CZT. 

6.26  Comparison  with  the  Cepstral  Smoothing  Method 


Schafer  and  Rabiner  (1970)  have  developed  a  system  for  for¬ 
mant  analysis  by  a  peak-picking  algorithm  applied  to  a  cepstrally- 
smoothed  spectrum  (i.e.  a  low-pass  filtered  log  spectrum),  and  in 
casus  where  formants  were  believed  to  be  very  close  to  each  other, 
they  applied  the-  CZT  to  the  cepstrum  in  order  to  enhance  the 
formant  peaks  and  separata  the  formants.  It  js  of  interest  to 
compare  that  method  to  linear  prediction. 


First,  it  should  be  pointed  out  that  applying  the  CZT  to 

a 

the  cepstrum  corresponding  to  the  approximate  spectrum  P(w)  is 
equivalent  to  computing  P(o,w)  in  (6-34)  using  the  CZT,  because 

A 

S(z)  is  minimum-phase  (Schafer  and  Rabiner,  1970,  Appendix  B) . 

We  have  seen  that  the  enhanced  peaks  in  the  resulting  spectrum 
correspond  to  the  formant  frequencies  which  could  be  obtained 

A 

more  accurately  by  solving  for  the  poles  of  S(z).  Therefore,  un¬ 
like  the  method  with  a  cepstrally-smoothed  spectrum  where  the 
CZT  is  useful  in  obtaining  extra  information  about  formant  loca¬ 
tions,  applying  the  CZT  in  linear  prediction  adds  no  information. 

Another  point  of  comparison  is  that  both  types  of  spectra  are 
smoothed  versions  of  the  original  signal  spectrum.  One  method  does 
it  by  actually  low-pass  filtering  the  log  spectrum,  and  the  other 
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by  reducing  the  number  of  poles  of  an  all-poJe  approximate  spec¬ 
trum.  The  two  types  of  smoothing  are  not  equivalent,  however,  be¬ 
cause  in  linear  prediction  the  spectral  fitting  is  based  on  an 
all-pole  model  of  speech  which,  for  non-nasal  sonorants,  cor¬ 
responds  to  the  usual  model  of  the  vocal  tract  transfer  function. 

For  those  sounds,  we  would  expect  linear  prediction  to  give  a 
better  spectral  fit.  Figure  6-15a  shows  a  spectrum  of  a  Hamming 
weighted  25  msec  of  the  vowel  [a]  obtained  from  10  kHz  sampled 
telephone  speech,  and  superimposed  on  it  is  the  smoothed  spectrum 
obtained  by  linear  prediction  with  p  =  14.  Figure  6-15b  shows 
the  corresponding  cepstrally-smoothed  spectrum.  (The  cepstrum  has 
unity  weighting  up  to  1.5  msec  and  cosine  weighting  up  to  3.0 
msec.)  Note  that  a  simple  peak  picking  algorithm  in  Fig.  6-15b 
would  result  in  a  false  third  formant  at  2  kHz.  Because  we  know 
the  spectral  characteristics  of  the  vowel  [a] ,  the  third  formant 
is  more  likely  at  2.8  kHz  as  shown  in  Fig.  6-15a. 

High-pitched  speech  normally  gives  rise  to  problems  in  for¬ 
mant  tracking  due  to  the  fact  that  for  voiced  sounds  the  spectral 
» 

f  harmonics  are  widely  separated.  We  have  seen  in  Section  6.2  that 
this  results  in  a  basic  loss  of  information  about  the  formant 
structure,  a  loss  that  cannot  be  recovered  even  by  pitch-synchronous 
analysis,  unless  new  information  is  added.  We  have  also  suggested 
that  the  method  of  linear  prediction  should  perform  quite  well 
(with  nonnasal  sonorants)  because  of  the  fact  that  we  assume  an 
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all-pole  model,  which  amounts  to  additional  information.  In 
cepstral  smoothing  the  cut-off  point  of  the  low-pass  filter  is 
placed  below  the  pitch  peak,  which  for  high-pitched  speech  can 
mean  a  further  loss  of  information  about  the  formant  structure. 

In  linear  prediction,  the  formant  locations  are  Jess  affected  by 
the  pitch  because  the  harmonics  are  forced  to  fit  the  all-pole 
model.  This  is  a  well-known  property  of  analysis-by-synthesis 
methods.  (Mermelstein  (1967)  has  suggested  a  method  for  smooth¬ 
ing  the  spectrum  by  subtracting  an  approximation  to  the  effects 
of  the  fine  structure  from  the  spectrum,  thus  bypassing  the 
problems  that  arise  from  low-pass  filtering  the  spectrum.) 

Although  for  nonnasal  sonorants  linear  prediction  is  expec¬ 
ted  to  give  more  accurate  formant  values  than  the  cepstral  smooth¬ 
ing  method,  the  same  is  not  necessarily  true  for  other  sounds 
such  as  nasals  and  fricatives,  whose  spectra  are  known  to  have 
antiformants  as  well  as  fc.mants.  The  problems  involved  have 
been  discussed  in  Section  6.21. 
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CHAPTER  VII 
CONCLUSIONS 


Linear  prediction  is  an  autocorrelation-domain  analysis. 
Therefore,  it  can  be  approached  either  from  the  time  or  frequency 
domain.  Although  the  actual  computations  are  performed  in  the 
time  domain,  we  chose  to  derive  the  most  general  formulations  for 
linear  prediction  from  the  frequency  domain  because  of  the  domi¬ 
nance  of  spectral  analysis  in  speech  research.  We  have  shown 
that  all  least-squares  methods  of  linear  prediction  can  be  derived 
from  a  single  general  concept,  namely  that  of  generalized  analysis- 
by-synthesis.  Here  the  2D-spectrum  (two  dimensional  spectrum)  of 
a  nonstationary  signal  (such  as  speech)  is  to  be  approximated  by 
another  2D-spectrum,  where  the  error  to  be  minimized  is  proportional 
to  the  integral  of  the  ratio  of  the  signal  spectrum  to  the  approxi¬ 
mate  spectrum.  This  error  criterion  was  shown  to  be  very  desirable 
for  a  good  spectral  envelope  fit.  In  the  special  case  when  the 
approximate  spectrum  consists  of  poles  only,  the  generalized 
method  reduces  to  the  general  Covariance  method  of  linear  predic¬ 
tion.  If,  in  addition,  the  signal  is  assumed  to  be  stationary, 
the  2D-spectrum  is  replaced  by  the  ordinary  power  spectrum,  and 
the  Covariance  method  reduces  to  the  Autocorrelation  method  of 
linear  prediction. 


I 

i 

i 

\ 


The  linear  prediction  speech  production  model  assumes  the 
vocal  tract  to  be  fixed  in  shape  within  a  portion  of  the  speech 
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signal  (a  frame)  on  the  order  of  10-25  msec.  Within  each  frame, 
the  speech  signal  is  assumed  to  be  nonstationary  in  the  Covari- 
I  ance  method  and  stationary  in  the  Autocorrelation  method.  In 

I  general,  the  assumption  of  nonstationarity  is  a  better  assumption 

for  speech  signals.  However,  within  a  frame,  the  speech  signal 
can  be  considered  to  be  quasi-stationary ,  so  the  asf  ompfcior.  of 
stationarity  in  the  Autocorrelation  method  is  not  a  bad  one.  In 
general,  one  would  expect  the  Covariance  method  to  give  better 
results  than  the  Autocorrelation  method,  especially  with  analysis- 
synthesis  systems.  However,  for  other  speech  applications,  the 
advantages  of  one  method  over  the  other  do  not  seem  tc  be  sig¬ 
nificant. 


1 


In  computing  the  predictor  coefficients  from  a  single  frame 
we  defined  two  basic  methods:  the  direct  and  indirect  methods. 

In  the  direct  method,  the  signal  is  weighted  by  a  window  that  is 
zero  outside  the  frame,  and  the  resulting  signal  is  considered  to 
be  infinite.  In  the  indirect  method,  an  unwindowed  finite  portion 
of  the  signal  is  used.  The  most  popular  and  useful  methods  are 
the  direct  Autocorrelation  and  indirect  Covariance  methods.  As  a 
general  rule  of  thumb,  the  indirect  method  works  well  for  almost 
any  frame  size,  but  the  direct  method  works  well  only  for  a  frame 
size  of  at  least  one  pitch  period,  with  a  proper  choice  of  window 
shape.  We  have  developed  criteria  that  a  window  function  must 
satisfy  in  order  to  give  good  results. 
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The  direct  Autocorrelation  method  was  discussed  in  detail 
because,  with  this  method,  it  was  possible  to  examine  in  what 
manner  the  ali-pcle  linear  prediction  spectrum  approximated  the 
signal  spectrum.  For  example,  from  the  normalised  error  curve 
it  was  possible  to  set  general  guidelines  to  help  determine  the 
number  of  poles  in  the  linear  prediction  spectrum  that  would  best 
approximate  the  envelope  of  the  signal  spectrum.  As  the  number 
of  poles  approached  infinity,  the  linear  prediction  spectrum 
became  identical  to  the  signal  spectrum,  while  the  linear  pre¬ 
diction  transfer  function  became  the  minimym-phase  counterpart 
to  the  signal  transfer  function.  Several  methods  were  suggested 
for  computing  the  minimum-phase  sequence  corresponding  to  the 
original  signal. 

The  study  of  the  normalized  error  in  the  direct  Autocorrela¬ 
tion  method  led  to  some  interesting  and  important  results.  First, 
we  showed  that  the  normalized  error  was  equal  to  the  ratio  of 
the  geometric  mean  of  the  linear  prediction  spectrum  to  its 
arithmetic  mean.  The  arithmetic  mean  of  the  spectrum  is  equal 
to  the  energy  in  the  signal,  while  the  geometric  mean  is  equal 
to  the  exponential  of  the  zero  quefrency  component,  cQ,  of  the 
cepstrum.  Thus,  the  normalized  error  measure  is  a  form  of  nor¬ 
malization  of  Cq  with  respect  to  the  energy  in  the  signal,  and 
the  resulting  ratio  is  a  function  of  only  the  shape  of  the  spec¬ 
trum.  The  properties  of  the  normalized  error  are  a  reflection 
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of  the  properties  of  c^.  One  such  property  is  the  usefulness  of 

the  normalized  error  in  the  detection  of  voicing.  It  was  shown 

that  such  usefulness  depended  completely  on  the,  spectral  shapes 

of  the  sounds.  Any  processing  of  the  signal  that  changed  its 

> 

spectral  characteristics  was  seen  to  have  a  possible  detrimental 
effect  on  the  usefulness  of  the  normalized  error  as  a  voicing 
detector.  Speech  preemphasized  by  differencing ,  and  telephone 
speech,  were  given  as  examples  of  such  processing.  Under  these 
circumstances ,  it  was  suggested  that  the  first  autocorrelation 
coefficient  would  be  a  better  voicing  detector. 

Filtering  the  speech  signal  by  the  linear  prediction  inverse 
filter  results  in  an  error  signal.  For  voiced  sounds,  this  error 
signal  often  shows  distinct  pulses  at  the  start  of  each  pitch 
period.  These  "pitch  pulses"  can  be  used  for  pitch  extraction. 

In  cases  where  the  signal  is  not  rich  in  harmonics,  e.g.  during 
sonorant-nonsonorant  transitions  and  for  voicing  of  stops  and 
fricatives,  pitch  pulses  are  likely  not  to  be  prominent,  and 
therefore  pitch  would  have  to  be  estimated  by  some  other  means, 
such  as  peak  picking  of  the  speech  signal  itself. 

Another  application  of  linear  prediction  is  in  the  estimation 
of  formants  of  the  vocal  tract.  These  formants  are  estimated 
from  the  poles  of  the  linear  prediction  transfer  function.  We 
discussed  several  factors  that  influence  the  extent  to  which 
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extracted  formant  values  correspond  to  actual  resonances  of  the 
vocal  tract.  We  concluded  that  formant  extraction  by  linear  pre¬ 
diction  works  well  with  nonnasal  sonorants.  However,  if  the  trans¬ 
fer  function  of  the  vocal  tract  contains  antiresonances  as  well  as 
resonances,  as  is  the  case  for  nasals  and  fricatives,  then  linear 
prediction  is  inadequate  for  the  extraction  of  the  formants  and 
anti formants. 
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Because  computing  the  poles  of  the  linear  prediction  trans¬ 
fer  function  is  expensive,  we  discussed  formant  tracking  by  peak 
picking  of  the  linear  prediction  spectrum  as  an  alternate  inex¬ 
pensive  method.  Unfortunately,  not  all  formants  are  represented 
by  peaks  in  the  spectrum.  Two  methods  were  discussed  to  render 
peak  picking  more  effective.  The  first  method  involves  preproces¬ 
sing  the  speech  signal  by  preemphasis.  Preemphasi.s  by  differencing 
was  seen  to  be  effective,  except  that  it  had  some  undesirable  side 
effects,  such  as  shifts  in  formant  positions,  especially  the  first 
formant.  The  second  method  did  not  involve  any  preprocessing  of 
the  signal.  One  merely  computes  the  linear  prediction  spectrum 
along  a  circle  inside  the  unit  circle  (which  corresponds  to  a 
line  parallel  and  to  the  left  of  the  jw-axis) .  The  resulting  "off- 
axis  spectrum"  has  proven  to  be  both  efficient  and  effective. 

One  issue  of  importance  to  most  types  of  speech  analysis  is 
the  choice  of  frame  width  and  position.  This  issue  was  discussed 
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in  terms  of  pitch-synchronous  and  pitch- asynchronous  analysis. 

The  latter  type  of  analysis  included  a  detailed  discussion  of 


windowing. 


J 
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APPENDIX  A 

ON  THE  Z-TRANSFORM  AND  FOURIER  SERIES 

In  this  appendix  we  shall  define  the  z-transform,  its  in¬ 
verse,  and  their  relation  to  Fourier  series  and  the  Laplace  trans¬ 
form. 

A. 1  Definition  and  Properties  of  z-Transforms 

Given  a  sampled  sequence  x(nT),  defined  for  all  n,  where  n 
is  an  integer  and  T  is  the  sampling  interval,  the  z-transform  of 
x(nT)  is  defined  as: 

00 

-n  (A-l> 


x(z) 


-Ex 


(nT)  z 


n--«° 


The  operator  z  is,  in  general,  complex  and  is  defined  in  terms  of 
the  Laplace  operator  s  as  follows: 


*  =  e31  .  e(0*j")T 


(A- 2) 


where 


(o  »  2irf  is  the  radian  frequency  in  rad/sec, 
o  is  the  damping  factor  in  rad/sec, 


1 

7. 


and 


T  =  •*  is  the  sampling  interval  in  seconds, 


is  the  sampling  frequency  in  Hz. 


x(nT)  could  in  general  be  complex  but  is  often  real  in  actual 
applications. 


i 

s  4 
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The  inverse  z-transfcrm  of  X(z)  is  then  x(nT)  and  can  be 
shown  to  be  equal  to  {Gold  and  Rader,  1969,  pp.  26-27) : 

x(nT)  =  z^1  dz  ,  (A-3) 

where  the  path  cf  integration  encloses  the  region  of  convergence 
of  X(z) . 

The  relation  between  s  and  z  in  (A-2)  defines  a  mapping 
between  the  s-plane  and  the  z-plane.  It  is  very  important  to 
understand  the  nature  of  this  mapping  for  a  thorough  understand¬ 
ing  of  z-transforms.  The  s-plane  shown  in  Fig.  A-la  has  been 
divided  by  horizontal  dashed  lines  into  strips  of  width  -  2»fs. 

There  are,  of  course,  an  infinite  number  of  these  strips  in  the 
s-plane.  According  to  (A-2)  ,  each  strip  of  width  2irf  ,  as  shown 
in  Fig,  A-l,  maps  into  the  entire  z-plane.  Therefore,  the  mapping 
from  the  s-plane  to  the  z-plane  is  an  infinity-to-one  mapping. 

For  a  particular  configuration  in  the  z-plane  (see  Fig.  A- lb) ,  the 
s-plane  consists  of  an  infinity  of  repeating  strips  of  identical 
configurations.  Each  pole  (or  zero)  in  the  z-plane  maps  into  an 
infinite  number  of  poles  (or  zeros)  in  the  s-plar>e  separated  by 
w  =  2irf  .  This  is  shown  in  Fig.  A-l  for  the  poles  a,  b,  b,  and  c, 
where  the  over-bar  denotes  complex  conjugate.  As  can  be  seen 
from  (A-2)  and  Fig.  A-l,  the  jw-axis  (o=0)  maps  into  the  unit 
circle  z=eJ  in  the  z-plane.  The  left  half  of  the  s-plane  maps 
into  the  region  inside  the  unit  circle,  while  the  right  half  of 
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the  s-plane  maps  into  the  region  outside  the  unit  circle.  A 

v  tical  line  at  c in  the  s-plane  maps  into  a  circle  defined 
C0T  iwT 

bj  z»e  .  A  horizontal  line  at  u>=Wg,  as  well  as  lines  at 

u0  +-jp—  in  the  s~plane,  map  into  a  radial  half- line  emanating 
from  the  origin  of  the  z-plane  and  defined  by  z  *  e  eJ  0  . 

In  particular,  the  real-axis  (u)»0)  in  the  s-plane  maps  into  the 
positive  real  half-line  (z  real  and  >0)  in  the  z-plane.  The 
negeitive  real  half-line  (z  real  and  <(  ;  of  the  z-plane  maps  into 
horizontal  lines  at  «  =  (2k+l)  £  in  the  s-plane.  These  horizon¬ 
tal  lines  form  the  boundary  lines  between  strips  in  the  s-plane. 

This  latter  mapping  is  quite  unique  in  the  context  of  z  to  s  map¬ 
ping.  This  can  be  seen  by  examining  how  the  poles  in  the  z-plane 
shown  in  Fig,  A- lb  map  into  corresponding  poles  in  the  s-plane. 

Also,  ve  shall  concentrate  on  the  center  strip  in  the  s-plane 
ranging  from  ^  to  |  .  The  positive  real-axis  pole  a  in  the  z- 
plane  maps  into  a  real-axis  pole  in  the  s-plane.  The  complex 
poles  b  and  E  in  the  z-plane  map  into  corresponding  complex 
poles  in  the  g-plane.  However,  the  negative  real  pole  c  in  the 
z-plane  maps  into  complex  poles  in  the  s-plane.  Figure  A-2c  shows 
a  single  period  of  the  amplitude  frequency  response  for  a  single 
negative  real  pole  in  the  z-plane  {zc  =  -0.6).  Compare  that  with 
Fig.  A-2a  for  a  positive  real  pole  (z  =  1.7),  and  with  Fig.  A-2b 

a 

for  a  complex  conjugate  pair  of  poles  (zfa  *  0.4(l+j\/3)#  «  0.4  (1- jVJ) ) 
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Also,  compare  the  digital  frequency  response  in  each  case  with 
the  corresponding  analog  (s-plane)  response  which  is  the  response 
of  the  poles  that  are  in  the  center  strip  |r  in  Fig.  A-la. 

A. 2  z-Transform  and  Fourier  Series 

In  order  to  relate  z- transforms  to  Fourier  series  we  let 
a= 0  in  (A~2)  ,  resulting  in  z  =  e-5^.  Substituting  for  z  in  (A™1) 
we  obtain: 


X(o)}  =  £  *<nT)  e  ^nWT 
n=-» 


(A- 4) 


where  X(w)  stands  for  X(e^wT/  . 

The  inverse  transform  of  X(iu)  is  obtained  by  substituting 
z  =  e-JU>,r  in  (A- 3)  and  taking  the  path  of  integration  around 
the  unit  circle.  The  result  can  be  easily  shown  to  be: 


tt/T 

x(nT)  =  J  X(w)  e^nu)T  dw.  (A-5) 

-tt/T 

Equations  (A-4)  and  (A-5)  can  be  viewed  simply  as  the  ordinary 
Fourier  series  transform  pair,  but  with  time  and  frequency  inter¬ 
changed.  In  traditional  Fourier  series  analysis  the  time  function 
is  normally  continuous  and  periodic  while  the  frequency  domain  is 
discrete  (i.e.,  the  transform  exists  only  at  multiples  of  the  fun¬ 
damental);  in  other  words,  the  frequency  function  is  sampled.  On 
the  other  hand,  in  z-transform  analysis,  the  time  function  is 
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sampled  while  the  frequency  function  is  continuous  and  periodic. 
Therefore,  we  can  make  the  general  assertion  that  sampling  in  one 
domain  corresponds  to  periodicity  in  the  transform  domain.  We 
have,  as  a  corollary,  that  if  a  function  in  one  domain  is  both 
sampled  and  periodic,  then  the  transform  function  must  also  be 
both  sampled  and  periodic.  Another  way  of  stating  this  is  that 
if  a  time  function  is  sampled  and  its  frequency  transform  is  also 
sampled,  then  both  functions  must  also  be  periodic.  Indeed,  this 
is  one  of  the  principal  properties  of  the  discrete  Fourier  trans¬ 
form  (Gold  and  Rader,  1969,  Ch.  61. 

We  have  seen  above  that  the  2- transform  with  o  =  0  reduces  to 
the  Fourier  series  transform.  We  also  know  that  the  Laplace  trans¬ 
form  with  o=0  reduces  to  the  Fourier  integral  transform.  There¬ 
fore,  we  can  say  that  the  z-transform  is  to  Fourier  series  what  the 
Laplace  transform  is  to  Fourier  integrals.  This  analogy  can  be 
very  useful  in  under standing  the  workings  of  the  z-transform. 

We  shall  give  one  example  where  the  result  is  obtained  by 
analogy  to  Fourier  series.  Consider  a  continuous  and  periodic 
function  of  time  x(t)  with  period  T,  having  a  transform  in  the 
frequency  domain  Then,  the  energy  in  one  period  of  the 

signal  can  be  obtained  from  the  time  domain  as  well  as  the  fre¬ 
quency  domain  as  follows: 
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T£2 

Energy  ~  f  j  |x(t)|2  dt  ■  |x|£j|2.  (A-6) 

-T/2  n=-® 

This  is  a  special  case  of  Parseval's  theorem  (Lee,  I960,  p.  11). 
Now,  by  carefully  interchanging  time  and  frequency  in  (A-6)  we 
have: 

ir/T  “ 

Energy  =  2^  J  lx(“) 1 2  du  =  ^  |x(nT) |2  .  (A-7) 

-tt/T  n=-® 

This  says  that  the  total  energy  in  a  sampled  signal  x(nT)  can  be 
obtained  by  integrating  over  a  period  of  the  power  spectrum. 
Equation  (A-7)  can  be,  of  course,  also  derived  directly  from  (A-4) 
and  (A-5) ,  but  we  wanted  to  demonstrate  how  one  might  use  the 
analogy  with  Fourier  series. 
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APPENDIX  b 


THE  AUTOCORRELATION  METHOD 
AND  ORTHOGONAL  POLYNOMIALS 


The  inverse  filter  H(z)  defined  in  (2-3)  is  a  function  of  p, 
the  number  of  predictor  coefficients.  Here  we  shall  make  this 


dependence  explicit  by  writing: 


Hp(Z)  .  1  -  £ 
k*l 


ak  2 


(B-l) 


In  this  appendix  we  shall  use  the  results  of  Grenander  and  Szego 
(1958)  to  show  that  HQ(z),  H^z),  . Hp(z),...  form  a  unique 
set  of  polynomials  that  is  orthogonal  on  the  unit  circle  with  re¬ 
spect  to  the  signal  power  spectrum  P(u>).  This  will  lead  us  to 
certain  properties  of  Ilp(z) ,  and  to  a  derivation  of  the  solution 
to  the  autocorrelation  normal  equations  (3-17) .  Me  call  this  so¬ 
lution  the  Fast  Autocorrelation  method. 

B. 1  Orthogonal  Polynomials  on  the  Unit  Circle 

Let  P (x)  be  a  nonnegative  and  Lebesque-integrable  function, 


P(x)  >  0,  all  x. 


(B-2a) 


P(x)  dx  <  C, 


(B-2b) 


where  C  is  some  finite  constant. 
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Also,  let  the  inverse  Fourier  transform  of  P(x)  be  given  by: 


Ti 

*k  =  27  /: 


P  (x)  e^x  dx  . 


(B-3) 


We  form  a  system  of  polynomials 


4>q(z)  ,  4*i  ( z )  ,  . » .  #  ^jj(z),  ... 

of  the  complex  variable  z  which  are  orthonormal  on  the  unit  circle 
z=*e-*x,  with  the  weight  ^  P{x).  These  polynomials  satisfy  the 
following  two  conditions: 

(i)  4>n(z)  is  a  polynomial  of  degree  n  in  which  the  coef¬ 
ficient  of  zn  is  real  and  positive; 

(ii)  the  inner  product  (4>n(z)  ,4>m(z) )  with  respect  to  P{x) 
is  given  by: 


(<t>n(z)  ,<J>m(z) ) 


71 

hf  *  n 


(z)  ?mlz)  P(x)  dx  *  6^,  z  *  eJ  ;  n,  m»0,l,2,... 

(B-4) 


I 


where  the  over-bar  denotes  complex  conjugate. 

Grenander  and  Szego  (1958,  pp.  12-14,  35-42)  have  shown  that  the 
set  of  polynomials  {$n(z) }  is  uniquely  determined  by  conditions 
(i)  and  (ii) . 
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Each  polynomial  4>n <z )  is  given  by: 


4>  (z)  »  (D  . D  ) 
vnl  '  n-1  n' 


Ro 

R1 

R2 

Rn-1 

R 

n 

! 

| 

R-1 

R0 

... 

Rn-2 

R  , 
n-1 

i 

i 

R-2 

R-1 

Ro 

Rn-3 

Rn_, 
n  2. 

i 

(B-5) 

i 

• 

• 

• 

• 

• 

j 

• 

• 

© 

• 

• 

1 

t 

• 

• 

• 

• 

• 

I 

R, 

1-n 

R2-n 

R3-n 

R0 

Ri 

i 

1 

z 

z2 

“  •  •  • 

z11"1 

zn 

R0  R1 


R-1  R0 


•  •  ©  R 


n 

Vi 


where  Dn  «  det(Rj_i)0 


(B-6) 


R-n  Rl-n  **•  Ro 


If  we  let 


Vz)  -  kn  z 


(B-7) 


where  k„  is  the  coefficient  of  z  and  &  is  the  constant  term, 
n  n  ' 

then  the  polynomial  4>n (z)  is  shown  to  obey  the  recurrence  relation: 

kn  Vl(!)  "  Vl  z  Vz)  +  *n+l  z"  *n(z'1)  >  (D-8) 
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where  we  have  assumed  that  the  coefficients  of  4>  (z)  are  real. 

n 

I 

From  (B-8)  one  can  compute  <J>n(z)  recursively  given  the  following 
additional  properties : 


♦o<*>  -  Ro 


(B-9) 


(B-10) 


B.2  Application  to  Linear  Prediction 


If  we  let  x  =  wT  in  P(x),  and  let  P(w)  be  the  power  spectrum 
of  a  signal  with  finite  energy,  then  conditions  (B-2)  are  satis- 
•fied  and  R*  are  the  autocorrelation  coefficients,  which  are  real 
and  even.  From  (B-5)  we  see  that  4>n(z)  must  have  real  coefficients. 
Furthermore,  by  comparing  (B-5)  and  (3-17),  the  autocorrelation 
normal  equations,  it  can  be  shown  that: 


+n(z>  "  t  Vz>  - 

n 


(B— 11) 


where  H  (z)  is  the  inverse  filter  defined  in  (B-l)  and  A  is  the 
n  n 

gain  factor  defined  in  (2-3)  and  given  by  (3-37): 


An  *  En  =  R0  '  ylx  \  Rk  * 


(B-12) 


Ar  is  equal  to  the  minimum  total-squared  error  E^.  From  (B-l) , 
(B-ll)  and  (B— 7)  it  is  clear  that: 


A  =  . 

n  ^ 


(B-l 3) 


218 


Report  No.  2304 


Bolt  Beranek  and  Newman  Inc. 


Substituting  (B-13)  in  (B-ll)  and  the  result  in  (B-8)  v;e  have: 


kn  kn+l  ^  Vl<2)  *  k,.  kn+l  2"+1  Hn(2>  +  Vl  kn  ‘V*"1’ •  «*14> 


n+1 


11  n+1 


n 


.n+1 


Dividing  (B-14)  by  (kn  kn+1  z  )  we  obtain  the  recurrence  relation: 


Hn+l(z)  =  Hn(z)  +  Kn  z“(n+1)  Mz’1)  , 


n 


where  K  = 
n 


'n+1 


“n+1 


(B-15) 

(B-16) 


From  (B-10)  we  have: 


vn+l 


k2  +  *2 
Kn  n+1 


or 


,2  ,  2  o  2 

n  ~  n+1  n+1 


( 


n+1 


1  - 


n+1 


(B-17) 


\  n+V 

Substituting  (B-16)  and  (B-13)  in  (B-17)  we  obtain  a  recurrence 


relation  for  A: 

n 


A2  =  A2  1-  K2 
n+1  n  1  n 


(B-18) 


We  now  show  how  to  compute  Kn  (Markel  and  Gray,  to  be  published) 
Take  the  inner  product  of  Hn+^(z)  in  (B-15)  with  z  .  (The 

definition  of  the  inner  product  of  two  polynomials  is  given  by 
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the  left-hand  side  of  (B-4).) 

(Hn+l(z)r  z“(n+1)>  *  7?fH n+l(w)  aj(n+1)a,T  P(<*>)  dwT 

n+1 

I 

-tt  L  k*l 


j 

-71 


t rr  n+1 


j(n+l)uT  p(K)  dmT 


n+1 


R 


n+1 


-I 


ak  Rn+l-k 


(B-19) 


k=l 


If  we  let  i=p*n+l  in  the  autocorrelation  normal  equations  (3-15) , 
then  (B-19)  is  equal  to  zero: 


(h„+i<z)  ,  2-<»+1>J  =  0  . 


(B-20) 


Therefore,  from  (B-15)  and  (B-20)  we  have: 


K  =  - 
n 


[Hn(z)  ,  z 


-(n+1) 


(B-21) 


/ 


By  derivations  similar  to  that  given  above,  and  making  use  of 
(B-J2) ,  it  can  be  shown  that: 


n 


Vx  -  Yi 


a(n,R 
ak  Rn+l-k 


K 


n 


k=l 

IT 

n 


(B-22) 


(n) 


where  a£  are  the  predictor  coefficients  corresponding  to  Hn(z). 
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Equations  (B-15) ,  (B-18)  and  ( 3—  22)  in  addition  to  the  initial 
conditions 


Hq (z)  =  1 


Ao  c  R0  ' 


(B-23) 


give  a  complete  recursive  solution  for  the  polynomials  Hn(z),  and 
hence  a  solution  to  the  autocorrelation  normal  equations  (3-17). 

Equation  (B-15)  can  be  expressed  as  a  recurrence  relation 
in  terms  of  the  predictor  coefficients  a^.  Substituting  from 
(B-l)  in  (B-15)  we  have: 


n+1 

i  -U*+1)*-k 

- 2  -& 

k-1 

k*l 

n+1 

n 

1 4M1)  z'k  - 

k=l 

k*l 

n 

r-  (n+1)  + 


'Jjk*  2"k  +  Kn^l-kz'k  -  K„z'<n+1>-  (B-24) 


k*l  k=l 

By  equating  the  coefficients  of  equal  power  of  z  on  both  sides  of 
(B-24),  we  have: 


_ (n+1)  _  _K 
an+l  '  Kn 


(B-25) 


(n+1) 


-  a  (n)  J.  V  a  ^ n  ^  k  =  l  2 

“  ak  Kn  an+l-k  '  '  ' 
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Therefore,  the  solution  for  (3-17)  is  given  recursively  by  (B-23) 
(B-22) ,  (B-l 6)  and  (B-25) .  A  flow  chart  is  given  in  Fig.  B-l. 

the  computations  in  (B-25)  are  to  be  done  in  place,  one  must 
be  careful  not  to  destroy  newly  computed  values  as  others  are 
computed.  One  solution  is  to  compute  a^  and  an+^_jc  at  the  same 


time  since 


(n+U  (n)  (n) 

dn+l-k  an+l-k  Kn  ak 


Another  method  is  to  use  an  extra  array  b^  where 
bk  =  an+i-k  '  15kSn' 

then  a£n+1)=  a™  +  Kn  . 


In  Fig.  B-l,  AA  is  equal  to  A  ,  the  minimum  total-squared  er 

AA 

ror,  at  every  stage  of  the  computation.  Therefore,  -5—  is  equal 

0 

to  the  normalized  error  V  ,  which  is  discussed  in  Chapter  V.  If 
the  autocorrelation  coefficients  are  normalized  with  respect  to 
Rg  before  applying  the  algorithm  in  Fig.  B-l,  then  AA  will  be 
equal  to  the  normalized  error  at  every  stage.  Normalization  of 
the  autocorrelation  coefficients  is  especially  recommended  for 
those  who  are  using  a  computer  with  only  integer  arithmetic  cap¬ 
ability. 

•The  coefficients  Kn  in  (B-22)  are  the  same  as  the  partial 
autocorrelation  (PARCOR)  coefficients  of  Itakura  and  Saito  (1972) 


Flow  chart  for  the  solution  of  the  autocor¬ 
relation  normal  equations  (3-17).  This  is 
called  the  Fast  Autocorrelation  method. 
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2 

Since  the  minimum  total-squared  error  (En  =  Ar)  is  always  posi¬ 
tive,  we  conclude  from  (B-18)  that  Kn  must  obey  tha  relation: 


|Knl<l  . 


(B-26) 


B.3  Properties  of  H  (z) 


(a)  From  (B-ll)  and  (B-4)  we  haves 


/¥ 
r'/m  n 


H  (w)  H  (w) 


P(w)  dw 


6  ,  n,m  «  0,1, 2, 

nm 


(B-27) 


-rr/T 


(Hn(z)}  is  a  complete  set  of  polynomials  orthogonal  on  the  unit 
circle  with  An  as  the  normalizing  factor  for  Hn(z).  It  should  be 
remembered  that  (B-27)  holds  if  and  only  if  the  coefficients  R^ 
in  (B-3)  are  positive-definite  (see  Section  4.4).  This  is  guaran¬ 
teed  in  the  direct  Autocorrelation  method. 

For  n  =  m  =  p,  (B-26)  reduces  to 


r/T 

T  f  P(w) 

W-//T  5<“> 


dw  **  1, 


(B-28) 


where  P(w)  = 


i Hp (w) 


=  |s(w)|  is  the  approximate  spectrum. 


Note  that  (B-28)  is  identical  to  (5-3)  which  was  derived  in  a 


different  manner. 
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<b)  The  zeros  of  the  orthogonal  polynomials  H  (z)  all  lie 

P 

inside  the  unit  circle  (Grenander  and  Szego,  1958,  p.  40).  In 
other  words,  the  inverse  filter  Hp(z)  is  minimum-phase  and  the 
all-pole  filter  S{z)  is  stable,  as  we  have  observed  in  Section  3.4 
Again,  this  is  true  iff  the  coefficients  are  positive-definite. 
An  equivalent  necessary  and  sufficient  condition  is  given  by 
(B-26) .  Another  equivalent  condition  is  that  the  minimum  total- 
squared  error  be  positive. 

(c)  Since  the  system  of  orthogonal  polynomials  Hp(z)  is 
complete,  any  polynomial  in  z”^  of  degree  p  can  be  represented 
as  a  linear  summation  of  the  polynomials  HQ(z),  H^z)  , . . .  ,Hp(z) . 

In  other  words,  any  recursive  filter  of  degree  p  can  be  realized 
as  a  linear  summation  of  minimum-phase  recursive  filters  Hn(z) 
with  degrees  5p, 
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APPENDIX  C 

COMPUTATION  OF  SIGNAL  AND  APPROXIMATE  SPECTRA 

The  signal  power  spectrum  in  the  direct  Autocorrelation  method 
is  given  by: 


where  s(nT)  is  the  windowed  signal. 

The  approximate  or  linear  prediction  spectrum  p(w)  can  be 
defined  for  all  methods  of  linear  prediction  as: 


where  a^  ,  l£k<p,  are  the  predictor  coefficients  and  A  is  the  gain 
factor. 


P(oj)  and  P(w)  are  both  continuous,  periodic,  real  and  even 
functions  of  frequency.  The  periodicity  is  equal  to  ^  =  fg,  the 
sampling  frequency.  Therefore,  it  is  only  necessary  to  compute 

fs 

the  spectra  from  zero  frequency  to  a  frequency  of  — j.  Also,  it 
is  practical  to  compute  the  spectral  values  at  only  a  finite  num¬ 
ber  of  frequencies.  One  method  of  doing  this  is  to  use  the  dis¬ 
crete  Fourier  transform  (DFT)  (Gold  and  Rader,  1969,  Ch.  6)  which 
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can  be  computed  efficiently  by  fast  Fourier  transform  tecnniques 
(FFT)  (Cochran,  et  al. ,  1967).  Computation  times  for  the  FFT  can 
be  cut  approximately  into  half  by  using  the  fact  that  the  signal 
s(nT)  is  real  (see,  for  example,  Makhoul,  1970b,  Appendix  B) . 
Therefore,  P(to)  is  computed  at  discrete  frequency  intervals  by 
taking  the  magnitude  squared  of  the  FFT  of  the  signal  s(nT), 

*•  o 

P (w)  can  be  computed  by  dividing  A  by  the  magnitude  squared  of 
the  FFT  of  the  sequence:  1,  -a^,  ~a2'  •••*  “ap*  Arbitrary  reso¬ 
lution  in  the  frequercy  domain  can  be  obtained  by  simply  append¬ 
ing  an  appropriate  number  of  2eros  to  the  sequence  whose  FFT  is 
to  be  tsken.  If  the  number  of  zeros  is  large  compared  to  the 
length  of  the  original  sequence  (as  is  normally  the  case  in  com- 

A 

puting  P(w),  where  the  number  of  frequency  values  desired  is 
much  larger  than  p?  ,  the  FFT  algorithm  can  be  pruned  (Markel, 
1971)  resulting  in  a  saving  in  computation.  (Markel* s  algorithm 
is  based  on  a  radix-2  FFT.  We  have  implemented  a  radix- 8  pruned 
FFT  wnich  saves  time  only  if  the  number  of  points  in  the  FFT  is 
at  least  £  times  the  length  of  the  original  sequence.  For  exam¬ 
ple,  we  have  realized  a  saving  of  32%  over  the  regular  radix-8 
algorithm  by  computing  a  256-point  pruned  FFT  with  p  =  15.) 


A  more  direct  method  of  computing  P (w)  is  obtained  by  noting 
that  (C-2)  can  be  rewritten  as  follows: 
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and 


where 


and 


(03) 

(04) 

(05) 


The  coefficients  are  just  the  autocorrelation  coefficients  cor- 

P  _k 

responding  to  the  inverse  filter  H(z)  =  1-^jT  a^z 

These  coefficients  need  be  computed  only  once  for  use  in  (03) . 

If  for  every  frequency  of  interest  we  know  cos  (u»T)  ,  then  cos(kwT) 
can  be  computed  recursively  as  follows: 

cos  ( (k+1)  u)T]  =  2  cos(ioT)  cos(kwT)  -  cos  ( (k-1)  uT]  . 


Another  way  of  looking  at  this  is  to  note  that  if  cos(cdT)  =  x, 
then  cos(kwT)  =  T^fx),  the  Chebyshev  polynomials.  These  polyno' 
mials  obey  the  recurrence  relation: 
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Tk+i<x)  -  2x  Tk(x)  -  T^fx), 

with  Tq(x)  *  1.  Therefore,  given  2x,  can  be  computed  by 

a  single  multiplication  and  subtraction.  If  we  define  a  single 
computation  as  equal  to  a  multiplication  and  an  addition  (or 

A 

subtraction),  then  if  we  desire  P{u>)  at  M  values  of  frequency, 
the  total  number  of  computations  C  needed  is  equal  to: 

Cd  =  f  (p+3)  +  2pM  .  (Direct  Method) 

This  is  to  be  compared  with 

Cf  =  2M(log2M+l)  (Simple  FFT) 

for  the  base-2  regular  FFT.  For  p  =  14  and  M  =  128,  Cd/Cf  =  1.9. 

can  be  cut  approximately  in  half  if  each  cos(kwT)  is  already 
stored.  However,  we  know  that  there  exist  algorithms  which  cut 
Cf  by  at  least  half.  So,  on  the  whole,  the  FFT  is  approximately 
twice  as  fast  as  the  direct  method.  But,  the  efficient  FFT  al¬ 
gorithms  compute  the  transform  at  M  equidistant  frequency  points, 
where  M  is  a  power  of  2.  These  restrictions  do  not  apply  to  the 

A 

direct  method.  If  one  is  interested  in  computing  P(w)  along  a 
nonlinear  scale  of  frequencies,  the  direct  method  may  prove  to 
be  more  efficient. 
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This  is  a  list  of  most  of  the  symbols  used  in  this  report 
along  with  the  page  number  where  that  symbol  is  first  used  or 
defined. 
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Linear  predictor  coefficient 

PAGE 

3 

A,Ap 

Gain  factor  of  linear  prediction  transfer 
function  §(z) 

17 

bn,b(nT) 

Minimum-phase  sequence  corresponding  to  s(nT) 

93 

Bn 

Bandwidth  of  formant  n 

191 

B(Z) 

z-transform  of  b{nT) 

93 

cn,c(nT) 

Cepstrum  of  s(nT) 

98 

§n,c(nT) 

Cepstrum  of  s(nT) 

98 

c^,c’ (nT) 

Complex  cepstrum  of  s (nT) 

98 

d1 

Differencing  operator 

131 

D(z) 

z-transform  of  differencing  operator 

131 

e  ,e(nT) 

Linear  prediction  error  sequence 

31 

E,E 

Total- squared  error 

31 

t 

f0 

Inverse  of  window  width  r ' 

179 

fs 

Sampling  frequency 

16 

F0 

Fundamental  frequency 

141 

Fn 

Frequency  of  formant  n 

191 

H  (z)  ,H  (Z) 

z-tr  nsform  of  linear  prediction  inverse  filter 

17 

r 

p 

Order  of  linear  predictor 

3 

P(w) 

Signal  spectrum 

54 

P(u) 

Linear  prediction  cr  approximate  spectrum 

54 

P(a0,w) 

Off-axis  spectrum 

193 

Pe(u») 

Error  power  spectrum 

55 

P(w,t) 

Time-varying  power  spectrum 

76 

Q(to,u)' ) 

Two-dimensional  signal  spectrum 

79 

Q  (w,u)' ) 

Two-dimensional  spectrum  of  error  signal 

81 
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*k 

A 

Rk 

R^RtkT) 

R (t , t+T ) 

S 

sn,s(nT) 
s^,s' (nT) 
sn,s(nT) 

S  (z) 

S  (z)  ,§  (z) 

IT 

sn,s(nT) 

T 

Ti 

u0(x) 

U«1  (x) 

un,u(nT) 

U(z) 

V 

Vm 

Vmin 

VP 

wn,w(nT) 

w(t) 

W(f) 

z 

6ntn 

ntn 

r(w»n) 


Normalized  autocorrelation  of  signal  40 

Autocorrelation  of  signal  34 

Autocorrelation  of  differenced  signal  135 

/\ 

Autocorrelation  of  impulse  response  of  3(z)  44 

Apparent  autocorrelation  function  65 

Nonstationary  autocorrelation  function  76 

Laplace  operator  16 

Signal  to  be  analyzed  2 

First  difference  of  s(nT)  130 

Impulse  response  of  S(z)  44 

z-transform  of  s (nT)  17 

Transfer  function  of  discrete  p-pole  linear 
prediction  speech  production  model  17 

Linear  prediction  approximation  to  s(nT)  30 

Sampling  interval  3 

Toeplitz  form  71 


Impulse  function  78 

Step  function  177 

Excitation  sequence  for  speech  production  model  17 

z-transform  of  u(nT)  17 

Ratio  of  spectral  geometric  mean  to  arithmetic 
mean  115 

Lower  bound  on  V  116 

V  for  p  =  00  104 

P 

Normalized  error  40 

Discrete  window  function  65 

Continuous  window  function  173 

Fourier  transform  of  w(t)  173 

Complex  variable  of  sampled-data  frequency 
domain  16 

Kronecker  delta  44 

Alternate  two-dimensional  spectrum  76 
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Damping  factor  (real  part  of  s) 

(1)  Time  lag  for  autocorrelation 

(2)  Pitch  period 
Window  width 
Effective  window  width 
Covariance  coefficient 

Polynomials  orthogonal  on  the  unit  circle 
Radian  frequency  (imaginary  part  of  s) 
Radian  sampling  frequency 
Radian  frequency  in  2D-spectrum 
Radian  frequency  in  alternate  2D- spectrum 
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